Starting is Half the Battle: Collecting as the First Step to Software Preservation

Quick note: This blog post will also be posted on The Future of Information Alliance (FIA) website.

Image 3
Example of the types of software developed at the National Library of Medicine

As the National Digital Stewardship Resident (NDSR) at the National Library of Medicine, I am currently devising a software preservation pilot strategy. What this strategy entails is the repository ingest of software materials held on obsolete media, the description of said materials, and the creation or digitization of contextual materials. Complicating this project is the fact that there has not been a comprehensive collection strategy for software at NLM, and many documents and copies of software have been lost over time. With this in mind, the first, and perhaps most important step, is for institutions to include software and software documentation as a part of their pre-existing collection strategies.

Software preservation is, quite simply, the attempt to make software usable many years in the future. Although it is possible to save the bytes of a piece of software, providing access to it and making it usable is more difficult. Because software relies on complex technical infrastructures in order to operate properly, future users may not be able to interact with software in a meaningful way if an institution only saves the bytes. For a software program to function, it needs to be installed on the correct operating system, on the correct hardware, and with any necessary ancillary programs or code libraries also installed. For a preservationist, this can be a nightmare.

Emulation is one way to deal with these complex dependencies, and recent attempts to make emulation an easier option for libraries and archives have made immense gaines. There are a variety of services – EaaS and The Olive Archive – that help people emulate computing environments and assist libraries and museums with the vast technical dependencies of software programs. While this technology is a big step forward for the field and for access to software-based materials, it does not constitute a complete response to the need for software preservation. Before getting to the point of emulating materials, an institution needs to address how it will collect software.

The process of devising and implementing a collection strategy for software and related materials can be daunting even for institutions whose collections are already closely aligned with software, computing, and technology history. Regardless, a comprehensive collection strategy is the first and most important step to preserving software. Without a collection strategy, it is more likely than not that the software will be lost before an institution has managed to device and implement larger strategies for futures users to access that software, either through emulation or another tool.

A proper collection strategy for software based materials should reflect the larger collection goals of the institution. It does not make sense for an art library to begin collecting scientific software, but it does make sense for an art library to collect software that artists created or used as part of their creative process. Just as adding A/V materials to a collection strategy does not mean that an institution needs to collect all A/V materials, collecting software does not mean collecting all software. The same care and attention paid to the wider collection strategy needs to be taken when considering software acquisitions.

Presenting about software preservation at the National Library of Medicine; Photo by Ben Petersen

Part of this collection strategy should include contemporaneous documentation and manuals. Throughout my project at NDSR, I’ve relied on manuals and documentation in order to get software running and to understand what the software is meant to do. Without this documentation, having copies of the software would prove of limited value. Furthermore, the marketing and packaging material for software can have historical importance itself. Any collection strategy for born digital materials should consider what analog material also needs to be collected so that future historians, archivists, and researchers can properly contextualize and assign meaning to a piece of software.

It is important the the collection strategy is well-documented and added to the wide collection strategy documents at the institution. Communicating the importance of collecting software or software-based materials is an essential part of beginning a software preservation program. An institution relies on many employees to acquire, care for, and provide access to its materials. Creating accessible documentation about software collecting and engaging in an open dialogue about collecting software is an important aspect to creating a sustainable program.

After software is added to an institution’s collection strategy as is applicable, there are many other questions and issues to attend to. Moving forward, it will be vital for institutions to get software into a digital repository and off of volatile tangible media like floppy disks and CD-ROMS. If the first step is to collect software, the second step is to save the bytes. However, simply adding software to the collections strategy will ensure that the institution, in the future, will have materials to preserve and showcase in whatever manner it decides best suits its resources, audiences, and needs. Without a comprehensive collection strategy, future actions, projects, and programs will be severely limited.

Writing for The Signal

So, I wrote a longer blog post for The Signal, and it went up today. You can view it here.

It reviews some of what I’ve written on this blog, but generally expands on the process of inventorying and describing software.



Technological Optimism & Software Preservation

I’ve been going through Phillip Rogaway’s recent paper “The Moral Character of Cryptographic Work,” and the section on technological optimism was particularly striking. He writes, “Technological optimists believe that technology makes life better. According to this view, we live longer, have more freedom, enjoy more leisure. Technology enriches us with artifacts, knowledge, and potential. Coupled with capitalism, technology has become this extraordinary tool for human development. At this point, it is central to mankind’s mission.”

Later, he continues, “If you’re a technological optimist, a rosy future flows from the wellspring of your work. This implies a limitation on ethical responsibility. The important thing is to do the work, and do it well. This even becomes a moral imperative, as the work itself is your social contribution…a normative need [for the ethic of responsibility] vanishes if, in the garden of forking paths, all paths lead to good…. Unbridled technological optimism undermines the basic need for social responsibility.”

My first thought was a series of jokes about Max Weber. Someone more thoughtful than I needs to write a piece on the Protestant work ethic, capitalism, and the current shape of the tech industry. Maybe I will just write it when I am feeling more thoughtful, but that is for another day. As much as it would be fun to discuss how long-standing American mores affect the current shape of innovation, industry, and ethics in Silicon Valley, I am going to focus today on a topic that more closely relates to my current work at the National Library of Medicine.

One may not think that there is a direct connection between software preservation and the current prevalence of technological optimism. Preservation, of software or any other material, is simultaneously obsessed with the past and the future. A preservationist works for a future client, trying to ensure accessibility to those who may not be born yet. At the same time, the preservationist concerns herself with the past. Currently, I am digging through a pile 5 1/4 floppy disks from the private papers of a researcher. I use 30 year old manuals to try to fix hardware from the 1980’s. I’m sitting in pile of historic refuse, trying to ensure it will be usable in 2040. My relationship with time is a strange one – my thoughts, actions, and goals straddle the past and the future but barely touch the present.

The technological optimist, on the other hand, exists only (or at least mostly) in the present. If their work, as Rogaway argues, becomes a moral imperative in itself, then the past and future of that work does not have as much weight as the current status of the work itself. If all technology helps mankind and technological work is good, then why is it necessary to understand past technologies? Why is it necessary to understand the  future ramifications of our current technology? Outside of improving present technology for future use, the present-tense becomes the dominant focus when technology becomes its own moral good. Our understandings of time and how we fit into larger historical narratives effect how we relate to the world and, most importantly, the people in it. If we assume constant positive technological innovation, we assume many things about the lives of the people all around the world. If the iPhone improves my life, does the existence of the iPhone improve all lives? How will we weigh what these inevitably variant experiences?

Of course, the technological optimist, as Rogaway understands her, does not concern herself with these types of questions. Technology is progress and progress is good. Yet, as people begin to look more closely at the ethical and philosophical issues intrinsic in technological invention and use, there will need to be a historical record of technology. Software preservation is a key aspect of that historical record. Understanding software as an important historical object encourages the preservationist both to preserve software and the variety of documentation that software development and use relies on. Without documenting our technological history, we are far more likely to remain entranced with a world view that fails to consider the ramifications and nuances of technological innovation.

To be clear, this is not simply a question of understanding history so you are not doomed to repeat it. There is truth to that statement, but it is not the point. It is easy to disregard documenting software because software is intended for use. We use software until the next thing comes along and then we throw it away. However, with each piece of software we throw away, we also throw away ideas and innovations that may not have fit into contemporaneous understandings of technological progress. While a new version of a piece of software may be available, it does not mean that all aspects of that new version are actually improvements. We lose intellectual histories by disregarding software. With the loss of those histories, we also lose the ability to better discern what constitutes true growth and the role technology plays in that growth. If we are to create an understanding of technology free of the technology optimism that Rogaway describes, we need to preserve our software.

Brief NDSR Project Update

Over the past couple of months, I’ve been chipping away at my NDSR project, and I figure it is time to give an explicit update on my progress.

First, I’ve got a working Apple II and I’ve been running interactive tutorial software on it. Being that I’m not a medical doctor, I’ve failed every tutorial – apparently I have no idea how hyperthermia works. But the other librarians and I have had fun competing anyway. Below, you’ll see our current reigning champion:

CUGPlcMW4AAsRpx.jpg large

Next, I’m nearing a completed version of my white paper on the current status of software preservation. While the paper is available to the NLM community for comment, it will not be publicly available until comments have been received and considered. So, you can look forward to that!

I presented a poster at iPRES 2015 and had a wonderful time doing so. The conference was incredibly informative, and I enjoyed being able to contribute. Here, you’ll find an image of the poster I presented. If you have any questions about it, please don’t hesitate to get in touch.

Finally, I’m preparing a presentation at NLM to show off some of the materials we’ve acquired in the project so far (including RFP’s from the 1960’s, legacy software, and educational materials from the 1970’s forward). For the presentation, I also prepared a list of suggested further reading and other projects to check out. While some of the articles are more in-depth than others, I think many of these links will be of interest to a wider audience. So, please, take a peak:

Articles, Research Reports, and Recorded Lectures

On Going Projects

  • Olive Executable Archive, ongoing research project on emulation and software preservation at Carnegie Mellon University
    • Currently in testing at Stanford University Libraries
  • EaaS: Emulation as a Service, ongoing project on emulation at the University of Frieburg
    • Currently in use at Yale University Libraries
  • Video Game Play Capture Project at The Strong National Museum of Play
  • Recomputation.Org, project for the University of St. Andrews that works to preserve software in order to ensure the ability to recompute computational science experiments over at least a 20 year period
  • Astrophysics Source Code Library (ASCL), project for the University of Maryland that registers academic source code from Astrophysicists and provides unique ID’s for citation; working on the possibility of archiving the code as well

Super Fun Online Examples

  • Theresa Duncan CD-ROMS, project from Rhizome that makes interactive CD-ROMS from the artist Theresa Duncan available online through Eaas.
  • Ageny Ruby, early web AI by Lynn Hershman Lesson commissioned by the San Francisco Museum of Modern Art as part of a larger art project on computers and romance, including the film Teknolust
  • The Internet Archive’s Software Collection, includes early games as well as other types of software run through an emulation (JSMESS)

Creating Intellectual Boundaries in Complex Computing Environments: A Small Way in Which Copyright May Actually Help Software Preservation

Early on in my software preservation project, I wrote a blog post about the difficulties of researching software history. One of the issues I list is establishing the boundaries for a particular piece of software. As I have inventoried the software developed at the National Library of Medicine (NLM), I have been forced to established what constitutes a single piece of software each time I create a new record. Software goes through many versions, so naturally the preservationist needs to create temporal boundaries. In other words, they need to decide which version or versions of the software need to be preserved. However, it is far more difficult and just as necessary to decide what constitutes one piece of software within a complex system that relies on many moving parts in order to function properly over time.

At NLM, for example, developers experimented with creating a “coach” for Grateful Med. Grateful Med was a user-friendly front-end system that searched NLM’s networked databases. Named Coach Metathesuarus Browser,  this software was designed to hook into Grateful Med and provide assistance to the user when the user’s search queries returned inadequate responses. This piece of software was fully developed and tested in NLM’s Reading Room version of Grateful Med, but, for a variety of reasons, it was not adapted widely. While conducting my inventory, I decided that Coach Metathesaurus Browser constituted a separate piece of software and therefore justified its own record. My reasoning was that it had a separate institutional history from Grateful Med and that this history needed to be acknowledged. But, if Coach Metathesaurus Browser had been implemented, users may have interpreted it as a feature of Grateful Med and not as its own entity. What this means is that if Coach Metathesaurus Browser had been implemented, I would have been required to prioritize either the experience of the developer or the experience of the user in my inventory. Considering the goals of my current project, I would have made the same decision and inventoried Coach Metathesaurus Browser separately, but it is important to note that this is an decision with intellectual ramifications.

The problem of boundaries offers an unexpected view of the role of copyright in software preservation. Generally, copyright is only viewed as an obstacle for an institution or individual that wishes to preserve software. It limits what can be done to a piece of software without the consent of the copyright holder and presents a serious issue for the long term access to the cultural heritage inherent in software. Yet, copyright may benefit software preservation projects in one way. Whereas I am working with materials that are not under copyright, a preservationist working with copyrighted materials already has boundaries imposed on a piece of software. The legal structure of copyright requires a clear definition of what is in an individual piece of software and therefore protected by the law. What this means is that boundaries around a piece of software are created at the time that the software is developed and by someone affiliated with the software development project.

Because NLM is a government entity, its software is not under copyright. As I create records in my inventory, I establish what constitutes an individual piece of software, and I draw the intellectual boundaries around that software. While the history of NLM and the history of software development informs these decisions, there is not always a clear right answer. If NLM’s in-house developed software was copyrighted, I could rely on the logic of copyright and the individuals who held that copyright in order to draw these boundaries. In other words, researching the copyright of a particular piece of software may remove the need to make a decision that could cause inaccuracies or anachronisms in a record.

When compared to the obstacles that copyright creates for most of software preservation, this one possible benefit is almost negligible. Yet, highlighting this unexpected aspect of the relationship between copyright and software preservation demonstrates the ways in which an archivist is reliant on context in order to decide what belongs in a collection and how it ought to be described. Where contextual information is scarce, as is frequently the case for software development and use, copyright can provide necessary information. Although this observation is not pertinent to my current project, I will continue to make decisions about what constitutes an individual software project very carefully because I understand how these decisions may affect future perceptions of historic software and computing behavior.

Oral Histories of NLM’s Software Development: The Not Aways Smooth Process of Technological Change

As a part of my National Digital Stewardship Residency (NDSR) project, I’m interviewing staff members about their experiences developing and using software at the National Library of Medicine (NLM). Doing these interviews may be my favorite part of the job right now. Each of these staff members has had invaluable insight into the cultural and technical aspects of software development and into the history of NLM itself. Not to mention, I’ve heard a funny story or two.

Most recently, my mentor and I drove to Frederick, MD to interview a retired staff member about her work on NLM’s software products. Before I delve it the interview, I need to thank my mentor for joining me on this Thursday morning adventure and for BBQ after. My mentor is a total mensch.

We drove in Frederick to interview Rose Marie Woodsmall, who worked at NLM for thirty years and was instrumental on AIM-TWX, Grateful Med, and innumerable other projects. Throughout the first two months of my project, her name kept popping up in interviews, in documentation, and even in casual conversations with staff members. Organizing the interview, however, was a bit difficult as Woodsmall had retired and started a sheep farm in rural Maryland. Perhaps my only regret from this interview was that I did not get to meet any of the sheep. Even without the sheep’s input, the interview provided amazing background and context.

The interview was filled with anecdotes, including how Grateful Med got its name. (You’ll have to wait for a later post with that story. I have a long blog post planned about naming conventions at NLM and how that affects research and institutional culture.) Perhaps the most interesting aspect of the interview, however, was how Woodsmall was able to outline the lived experience of major technological shifts in computing and how the technological changes were not always smooth. Woodsmall frequently acted as a liaison between software users and developers on development projects, and she had to navigate relationships between many stakeholders.

The overall push of software development through the years that Woodsmall was a staff member at NLM was towards more user-friendly systems. MEDLARS I, the earliest of NLM’s computerized catalog search systems, required an expert user. Implemented in 1964, it was only searched by specially trained medical librarians, and the training was two weeks long! These medical librarians worked in libraries and hospitals around the world and helped facilitate access to NLM’s resources, even at a geographic distance. It was a big step forward in medical librarianship and medicine in general.

After MEDLARS I, however, came MEDLARS II which allowed for online search capabilities and was slightly less cumbersome to use, although not ‘user-friendly’ or ‘online’ in the ways we understand those terms today. After implementation in 1971, MEDLARS II still required specialized training, and although it was ‘online’ and allowed remote searches of databases in real time, it was not connected to the Internet. In fact, the Internet had not yet been invented. MEDLARS II relied on dedicated phone lines to establish its network connection, and computers that were connected to MEDLARS II were not necessarily connected to any other networks. Some machines were dedicated to simply searching the NLM databases. These machines did not look like computers today. Some were teletype machines, like the one pictured below.

Image provided by the Providence Public Library
Image provided by the Providence Public Library

Although MEDLARS II sounds very limited considering current search capabilities on the Internet, it was another step forward for medical librarianship. It seems like common sense to assume that everyone welcomed the switch from batch-processing to online search. After all, why not make life easier? However, as Woodsmall pointed out, the switch was not universally applauded. Some of the search analysts, accustomed to their particular place within their individual institutions, were suspicious of the new technology and were concerned about losing their sense of prestige. One librarian even hid the teletype machine used to connect to the network in a closet. She did not want her patrons to use it, and she did not want them to see her use it. Eventually, people saw its utility and the transition became smoother.

Woodsmall talked about similar issues when NLM began to implement Grateful Med widely. Many librarians, even up to the 1990’s, were nervous about allowing end-users to search a database without their assistance. In a Letter to the Editor in the Bulletin of Medical Library Associations published in January 1994, Catherine J. Green, a librarian at Bethesda Memorial Hospital in Boynton Beach, Florida, quotes a recent lecture that argues that allowing end-users to search the database on their own “is not only a dumb idea, it is a dangerous idea!” Green and others were worried that doctors would inefficiently or inaccurately search for medical information using Grateful Med, thus impacting the quality of patient care. This concern is not unfounded, and although Grateful Med is now frequently praised for helping democratize access to medical information, this concern at the time of implementation is an important aspect to the technology’s history.

The point is that change is not always universally heralded, and it is possible to forget that fact when those changes are later determined to be technological progress. The move from batch-processing to online search and then to end-user search capabilities are all seen as positive advancements in medicine and medical librarianship now. But, when these changes were implemented, they were not always seen as progress in medical librarianship. Disagreements and long discussions about the viability of these technologies cannot be overlooked because they are an integral part of the way that technological change occurred.

Interviewing someone like Woodsmall is helpful because she was able to highlight those discussions and the more contentious aspects of software development. Implementing new technologies frequently requires advocacy, both within the organization and with users. As I continue to research the history of software development at NLM, interviews will remain important as they help contextualize the technology in the disagreements, debates, and compromises that occur throughout development and implementation.

Data-Centric Decision Making: Assessing Success and Truth in the Age of the Aglorithm

The New York Times published an article about the state of white collar workers at Amazon last weekend, and everyone has been abuzz about it since. I know I’m a bit late to this discussion, but I do have a day job. Anyway, the article portrays a toxic environment, where workers are pressured to work through health crises and are encouraged to tattle on each other through online apps. There is a lot to talk about in the article, and honestly, I’m not sure how much new ground I am going to cover. However, I feel the need to take this opportunity to talk about data-centric decision making, what it means for labor, and what it means for our understandings of truth, fact, and assessment. That is, of course, a lot to discuss, and I will not be able to adequately cover all of these ideas, but with the article in the public sphere right now, it seems like a good time to discuss what ‘data’ can mean and what it can mean in relation to preexisting power structures.

First and foremost, Amazon employs data-centric management. Productivity is calculated and shared. An employee’s performance is directly tied to quantifiable actions – the number of Frozen dolls bought, the number of items left un-purchased in the cart, etc. Being good at your job, in this situation, isn’t just about being competent; it is about constantly proving competence through data. This idea is not terrible at its root. For a couple of reasons, I kind of dig it. According to this 2005 study, creating quantifiable criteria can help managers avoid discriminating against women in hiring practices. Creating specific data points, in this situation, can help managers ensure that they view candidates in an equitable manner. I am sure you could find more examples where differing to ‘data’ or quantifiable criteria for job performance helps people who have been historically discriminated against in the workplace.

But, what happens when you stop paying attention to how that criteria is quantified? What happens when ‘data’ becomes code for ‘objective truth’ rather than what it is – a human constructed method for measuring reality that can fall prey to discriminatory practices in the same way as other modes of assessment? An algorithm is not a God-given measurement of truth; it is subject to the same prejudices and flaws as any other human invention.

I will give you an example to demonstrate how this this phenomena can occur. Safyia Umoja Noble, a professor at UCLA’s Information Studies Department, has written extensively on how search engines portray women of color. In an article for Bitch Magazine, she describes what happened when her students searched for ‘black girls’ online. was the first result. To be clear, the students did not mention porn in their search. This website was the first result for the simple query, ‘black girls.’ She describes similar results for ‘Latina girls,’ and other women of color. How could this happen? Should porn really be the first result for ‘black girls’? What about Google’s algorithm determined porn to be the most relevant search result?

After a quick thought, the answer is obvious. Google’s algorithm seems to take popularity into consideration when sorting results, meaning that if more people click on porn, then porn is higher in the results. Of course, Google’s algorithm is proprietary and secret, but this assumption does not seem to be outlandish. There is a lot to be said about what using popularity as a criteria for search engine results means for the representation of minorities, but it is best to read Noble’s work to get a thorough discussion of those matters. You’ll learn a lot more that way than if I try to summarize her work for you. Instead, I would like to make a simple point: the search engine is not infallible. It is a human-designed device that reflects preexisting human priorities. When these priorities are problematic, so are the search results.

In the ‘information age,’ or whatever we are calling it at this point, what this means is that preexisting priorities can be amplified in a way that they were not before. More people see these results, and more people can be influenced by them. The search engine does not necessarily provide truth, accuracy, or expertise. Instead, it can provide a magnified version of the problematic, inaccurate, and hurtful representations that have been created and enforced overtime. The algorithm is not truth; the algorithm is media.

Back to Amazon. It may feel like I’ve ventured from that New York Times article, but I haven’t really. As with a search engine, the methods of measuring worker productivity are as subject to human fallibility as other ways of measuring output. As humans create new ways to measure success, those measurements will reflect preexisting notions of what success means and what a successful person looks like. When this phenomena is hidden behind a perceived objectivity in numerical assessment, it is more difficult to argue against. When a manager can simply point to the ‘data’ rather than having an in-depth conversation about what worker output should look like, the workers themselves are left at a loss. In order to participate in negotiations at this level, workers either need to have a high-level understanding of how the criteria for success is quantified or they need to excel within the manager’s data-centric assessment system. And, clearly, excelling in a manager-designed assessment system may not best serve the needs of the worker.

There is a lot more to talk about here, but it will have to wait for another day. I’ll just leave you with one suggestion: it is time we stop asking to see the numbers and start asking to see the math.