The Society of American Archivists (SAA) 2012 annual meeting, Beyond Borders, began Monday, August 6 in San Diego with strong pre-conference sessions. I attended Digital Forensics for Archivists (DFA), a course focusing on specific tools and services that archivists need to use for their work with digital archives.

This is one of the many courses offered by SAA in its Digital Archives Specialist (DAS) Certificate Program.

digital_fingerprint_shutterstock_103215914.jpg

I understand if the emerging partnership between law enforcement and the archival enterprise seems unusual. But consider: digital forensics has established principles, technologies and methods for extracting data and associated metadata that closely parallels archival repositories’ best practices.

In other words, this class is not for the faint of heart.

Enter our hero, instructor Dr. Cal Lee, Associate Professor of the University of North Carolina at Chapel Hill. Prior to class, he distributed two illustrative papers:

  • Digital Forensics and Born-Digital Content in Cultural Heritage Collections by Matthew G. Kirschenbaum, Richard Ovenden and Gabriela Redwine with research assistance from Rachel Donahue, and
  • his own Extending Digital Repository Architectures to Support Disk Image Preservation and Access, collaboratively written with Kam Woods and Simson Garfinkel

which we dutifully read. Obligation became pleasure as the first treatise unfolded; however, at 109 pages it’s a bit of a tome. In ten pages the second article summarizes the first (let’s hear it for brevity!). I recommend them both.

Motivation and Scope

Dr. Lee opened his commentary with thoughts on motivation. “Archivists are often responsible for acquiring or helping others access materials on removable storage media,” he said. “Often information is not packaged nor describes as one would hope. Information professionals must extract whatever useful information resides on the medium, while avoiding the accidental alteration of data or metadata.”

He defined digital forensics as “the process of identifying, preserving, analyzing and presenting digital evidence in a manner that is legally acceptable.” The practice involves multiple methods of discovering digital data and recovering deleted, encrypted or damaged file information. He presented compelling points as to why archivists should care.

Two streams of activity show great promise for informing the practices of archivists:
  • a handful of innovative projects of collecting institutions exploring the application of digital forensics to acquisition, and
  • vendors and academic programs providing digital forensics training.”

He spoke reverently of several digital forensics projects: Stanford’s SULAIR, the Bodleian Library futureArch and the British Library in London.

Technical Background

Dr. Lee explained that digital objects are sets of instructions for future interaction. “Digital objects are useless if no one can interact with them. Interactions depend on numerous technical components.” He outlined the seven levels of representation:

  • Level 7: aggregation of objects. A set of objects that form an aggregation that is meaningful encountered as an entity.
  • Level 6: object or package. An object composed of multiple files, each of which could also be encountered as individual files.
  • Level 5: in-application rendering. As rendered and encountered within a specific application.
  • Level 4: file thru filesystem. Files encountered as discrete set of items with associate paths and file names.
  • Level 3: file as "raw" bitstream. Bitstream encountered as a continuous series of binary values.
  • Level 2: sub-file data structure. Discrete “chunk” of data that is part of a larger file.
  • Level 1: bitstream thru I/O equipment. A series of 1s and 0s as accessed from the storage media using input/output hardware and software.
  • Level 0: bitstream on physical medium. A set of physical properties of the storage medium that are interpreted as bitstreams at Level 1.

He cited interaction examples of each level. “Archivists fear three complicating factors the most,” said Dr. Lee. “Medium failure/bit rot, obsolescence, and volatility.” He launched into a detailed overview of the where and how a computer stores information: the computer memory hierarchy, sectors, clusters, magnetic disk, hard drive structure, caching, configuration/log files and the increasingly popular solid-state drive as well as areas designed to store temporary data.

He defined representation information according to the OAIS. I was particularly intrigued by the challenges presented by the consistent rendering of fonts and character encoding. We quickly interrupted the section with an exercise in bitstream corruption and then we were back on track discussing how computers store and manage files (in other words, volumes versus partitions).

The conversation shifted slightly to outline file system examples for PC and Mac. He defined the file systems archival institutions are most likely to encounter (ext, ext2, ext3, FAT16, FAT32, HFS, HFS+, ISOFS and NTFS). Next, Dr. Lee offered the pros and cons of two main technical options for curation of digital data over time:

  • capture periodic snapshots of full state information
  • capture and audit trail of changes that have occurred along the way

He concluded:

Technical strategy depends upon the raw size of the files, how often changes occur, how easy changes are to detect, the underlying structure of content, how tightly coupled the different files are, and what you’re actually trying to document. Consider the changes in the past ten years to the applications that we use today. They were prompted by communication from end users to software designers. Embedded PID GUID was abandoned in Office 2000. ‘Fast change’ was turned off by default in Word 2000. No more accidental dumping of RAM slack. We owe the e-discovery and forensics fields a debt.”

Conclusion

The agenda included three more partitions:

  • Ensuring the completeness and evidential value of data when acquired
  • Gleaning information from what you have (working down from the file level), and
  • Preserving and providing access to data from forensic disk images

We even conducted an in-class lab exercise in curation of unidentified files!

We sprinted through this rich class. SAA, I think it should be two days. As with your other DAS classes that I’ve attended this year, I think the workshop is missing a high-level pictorial to describe how the agenda meets class objectives. Those are my only critiques. Colleagues, if you require either a brush up on your skills or you’re concerned about long-term data retention in your information ecology, register and attend.

The Annual Meeting of the Society of American Archivists continues through Saturday, August 11, in San Diego.

Image courtesy of mkabakov (Shutterstock)

Editor's Note: To read more of Mimi Dionne's thoughts from past SAA meetings:

--  The Future is Now: New Tools to Address Archival Challenges #saa11