The 75th Annual Meeting of the Society of American Archivists (SAA) conference offered insights on new tools that address archival challenges. Here's a look.

From where I left off in my last article (What Happens After 'Here Comes Everybody': An Examination of Participatory Archives), the discussions continued with the following chairperson and panelists:

Chairperson Mark Conrad, National Archives and Records Administration
Panelist Kenton McHenry, National Center for Supercomputer Applications
Panelist Maria Esteva, Texas Advanced Computing Center
Panelist William E. Underwood, Jr., Georgia Tech Research Institute

Mark Conrad, National Archives and Records Administration

Conrad opened the session with commentary on the importance of new tools to help archivists carry out their functions from appraisal to access.

Kenton McHenry, National Center for Supercomputer Applications

McHenry presented The ISDA Tools: Computationally Scalable File Migration Services to Keep Your Files Current.

Many different formats save the same information. Will companies read their proprietary formats in the years to come? Will the specifications be there? Were the specifications ever available to begin with?

For example, consider 3D files. Software vendors create their own formats to save them. Many are proprietary (in fact, there are over 140 proprietary 3D formats). The recommended solution is to convert files to an open/standardized format -- with the understanding that some information may be lost. Are there third-party software available to do this?

Polyglot is available. Polyglot is a service that was created in 2009 to provide an extensible, scalable and quantifiable means of converting between formats.

The system is extensible in terms of being able to easily incorporate new conversion software, scalable in being able to distribute work load among parallel machines, and quantifiable in having a built in framework for measuring information loss across conversions.

McHenry offered “Imposed Code Reuse” or wrapping a third-party compiled software with a programmable interface. This involves operations captured through a GUI scripting language (e.g., “AutoHotKey”) to create a simple workflow, to compare files before and after the conversion, and to measure information loss. Also, Polyglot can create an elegant Input/Output (I/O) graph that illustrates the relationship from proprietary to open file types. The result: the format can be distributed across multiple machines and accessed through the Web.

ISDA File Migration Tools

Tool #1: Conversion Software Registry between formats. From the website:

The work is motivated by a community need for finding file format conversions inaccessible via current search engines and by the specific need to support systems that could actually perform conversions, such as NCSA Polyglot. In addition, the value of the CSR is in complementing the existing file format registries and introducing software quality information obtained by content-based comparisons of files before and after conversions. The contribution of this work is in the CSR data model design that includes file format extension based conversion, as well as software automation scripts, software quality measures and test file specific information for evaluating software quality."

Tool #2: Software Servers. From the website:

Software servers are founded on a notion of "imposed code reuse" or "software reuse". Analogous to traditional "code reuse", "software reuse" involves re-using functionality within compiled code. Sticking with our analogy to "code reuse" this involves re-imposing an API on the software so that the functionality can be called and used within new code.

The interface is consistent across all software; it’s simple and widely accessible, and capable of being programmed against. In fact, software servers share software functionality over the Web in a manner that is somewhat analogous to the way web servers share data. (Hint: software servers make desktop applications cloud- based. But this is another article for another time.)

Then McHenry exercised the software functionality sharing layer. Focusing on software server robustness, he demonstrated throughput of software on a software server and outlined results.

Tool #3: Polyglot using Software Servers. From the website:

The Polyglot service, focused on conversions, utilizes the "open", "save", "import", and "export" operations provided by a collection of distributed software servers. From these operations an input/output graph is constructed which stores formats at its vertices and conversions between input and output formats using a particular piece of software as its edges. In order [to] perform a conversion between a given input and output format we search this graph for a shortest path between the formats, identifying applications capable of performing the conversion and then calling the corresponding software server operations to carry it out.

In other words, Polyglot listens for software server broadcasts on the network, catalogs available input/output operations, identifies conversion paths between I/O, then proceeds with the (extensible) service.

Tool #4: Versus

A java library/framework for comparing file content under development, but it does have the following:

  • framework/API designed
  • distributed architecture
  • RESTful web interface
  • adding extractors, measures

McHenry concluded his presentation with further references to ISDA’s research.

Maria Esteva, Texas Advanced Computing Center

Esteva’s presentation, Mapping Archival Practices to Visualization, asked:

  • How can an archivist exam and process large electronic records collections?
  • What are the current opportunities and limitations?
  • What are the services and infrastructure needed?

Her presentation covered visual analytics: data analysis methods, visualization and interactivity; discoveries and inferences; the archival perspective; and design aspects.

“The key aspect,” Esteva said, “is making decisions about how to narrow information down so the results are meaningful.”

Using keywords, imagine searching across multiple finding aids of multiple repositories. Each repository and their classes of records are color-coded (a throwback to paper files). The repositories and their finding aids are searched simultaneously. Instead of spreadsheet results, she sees color-coded boxes of various sizes. Instead of searching one directory, path by path, she sees all high level directories and by size of the square knows how many images are available.

While more than one record group can be represented by squares that describe huge results, through visualization of data correlations are easy to discern. The researcher can determine which subcollections have more pertinent records by borrowing the tools created by the Texas Advanced Computing Center (TACC).

The TACC’s projectautomatically associates broader tags from the corresponding HTML pages. A specific tag equals a specific image. HTML pages are parsed to give further classification. They’re also matched to type of file extension. She can select the features she wants for a more dynamic visualization of collection content. Her research results decrease from 36,000 images to one or two most representative, even if the records are complex.

It’s not just for researchers. For repositories, the project provides a behind the scenes framework:

  • metadata extraction;
  • organized in an RDBMS;
  • knowledge organization (classes, categories);
  • queries, aggregations, data mining, statistical calculations, regular expression matching;
  • data transfer, computing, display systems; and,
  • visual representation as pixel-based rendering.

An archival repository’s staff can use this tool to analyze structure and characterization of records as well as detect errors.

Plus, although it’s still a prototype, the TACC has built a multi-touch display to work with the data. Although the audience only saw pictures, it somewhat resembles Microsoft’s Surface.

Esteva concluded her presentation with these thoughts about the project:

  • It’s a wonderful research package that provides analysis, visual representation, an interactive display and infrastructure support;
  • It uses Inductive methods to illustrate from massive to meaningful points;
  • It integrates veracious layers of information instantaneously;
  • It honors archivists’ expertise: metadata, ontologies, ideas and design; and,
  • Staff is allowed to decide the form/shape of archival systems.

William E. Underwood, Jr., Georgia Tech Research Institute

Underwood began his presentation, Tools for File Types and Records Type Identifications, with his research motivation. Archivists need the capability to identify formats for insuring compliance with the record transmittal agreement. Viewing and playing files, conversion to current or standard file formats, archive extraction, password recovery and decryption, repair of damaged files -- these are the issues he’s been consumed with in his 25+ years of professional experience.

Metadata extraction is a critical aspect of the ingestion of textual e-records into digital archives and libraries. Metadata is needed to support description of individual e-records and aggregations of these records and to support search and retrieval of records.

But first, he reviewed definitions.

  • A file format is a set of rules for encoding and decoding data or computer instruction in a file;
  • A file type is a class of files with the same file format;
  • A file format signature is invariant data in a file format that can be used to identify the file type (or format) of a file;
  • The magic number is the concept on an internal file format signature.

External file format identifiers come in the form of file name extensions or metadata stored in the operating system (think Multipurpose Internet Mail Extensions (MIME) media types or PRONOM unique identifiers (PUID)). Unix (now Linux) File Command and the Magic file is probably the most widely used tool for file format identification, he advised the audience. The file command applies tests for magic numbers contained in those files. But there are limitations:

  • Difficult to update the tests for magic numbers
  • Tests that may give conflicting results must be properly sequenced
  • Test for magic number are on 1:1 with file types
  • Tests output metadata as well as file type
  • Tests for character set and language of text files need refinements
  • Only a few tests exist for MS Windows file types
  • Tests for magic numbers have not been rigorously tested

Underwood demonstrated Magic Test for Broadcast Wave Format V1 (for an introduction to the Broadcast Wave Format, see here.

After the demonstration he continued to define terms for the audience. The documentary form consists of intellectual form and physical form. Intellectual elements are those terms or semantic categories that are common to a document type. Intellectual form is the rules that characterize the possible combinations of intellectual elements. Physical elements are the physical attributes of the intellectual elements. Physical form is the rules that characterize the physical forms of the layouts.

So, he posed to the audience: what is the best method for recognizing document forms and extracting metadata?

Document types are physical. They have specific grammars (for example, a memorandum) augmented with semantic rules. Once the parse tree and semantics of the document is established, then metadata is extracted for item description and indexing. Underwood’s team wrote grammars and semantics for 14 documentary forms. Then they compared the same 14 grammars, converted them to text, and ran them through data and documentary extractions.

In summary: intellectual elements of documentary forms can be defined in terms of the keywords and the semantic.

Underwood’s is eager to ask the next set of questions.

  • Can the intellectual elements of documentary forms be learned without a teacher?
  • Can grammatical induction be used with examples of particular document types to induce a grammar automatically?
  • Can the recognition method be extended to include the physical elements of documentary form and grammatical definition of the physical layout?

See Underwood’s team site here. See the latest research publication here.

Also, check out DROID: it is the first in a planned series of tools developed by The National Archives under the umbrella of its PRONOM technical registry service.

Editor's Note: You may also be interested in reading these other articles by Mimi Dionne: