The question of creating access to information is not a new one for archivists, but how some archivists are answering that question when handling large quantities of hard copy data in a digital environment is.

The Society of American Archivists (SAA) 2012 annual meeting, “Beyond Borders," concluded Saturday, August 11, 2012 in San Diego.

My conference was especially enjoyable because I heard Mark Conrad of the National Archives and Records Administration moderate his panel:

  • Richard Marciano (University of North Carolina, Chapel Hill),
  • Weijia Xu (The University of Texas at Austin, Texas Advanced Computing Center), and
  • Kenton McHenry (University of Illinois, Urbana-Champaign, National Center for Supercomputing Applications).

Unifying Access to Digitized and Born-Digital Collections

Richard Marciano

For background on the CI-BER project’s visual representation of geographic data, see this beautifully written essay “A System for Scalable Visualization of Geographic Records,” by Jeff Heard and Richard Marciano.

Cultural content is big data. As of last week, thanks to crowd sourcing, has 130 million names indexed. In the cultural data space, really interesting things are happening using community as asset. Think about the recently released 1940s census. If we visually interpreted the 1940s census, the amount of data is staggering: 6 million images. Extracting content at the cell and column level, that’s 7 billion cells transforming from paper to digital. That presents access and visualization challenges, right? In the same vein as those eHarmony algorithms, we’ve built tools to determine what’s applicable in cultural space,” said Marciano.

Enter the Internet Archives’ North Carolina historical city directories, a huge test bed of 100 million records. Team Marciano applied optical character recognition (OCR), ran XSLT scripts, extracted content, fully indexed the city directories database at a size of 5 terabtyes; they constructed prototypes, trolled the collection for geospatial records, went inside the files and extracted them. The results are impressive. He demoed a treemap structure interface, colored blocks representing numbers of records -- so much so that it resembled footprints of records.

Our goal is to bridge the paper collection. Provenance doesn’t matter, not really -- we ask large, integrative questions. This project is not for the faint of heart. For example, scholars may want to know, what are patterns of integration/discrimination across cities? Consider it: soon we will be able to link 1940s census data to the 1960s. These are unique questions. A lot of research can be done here. Essentially we’re asking, ‘how do you extract content at scale, blend, and make available federated national collections?'’”

Supporting Dynamic Access and Analysis of Large Scale Digital Collections

Wijia Xu

The archival enterprise is moving in the right direction. In 2001, we downloaded; we printed and curated; then we provided access to hardcopy. Beginning 2005, we crawl and compress; we curate the digital collection; we provide access to the digital data directly; we search by keywords; we retrieve webpages in specific time by URL.

“How do we unify access?” began Xu. “Because data is growing, in the case of library science, the problem is getting more complicated because of different formats (and our dependency on humans to index them).

Take, for example, the LAGDA web archive:

  • Files archived to date: 68 million
  • Data archived on disk: 5.8 TB
  • Archive growing at 900 GB/year
  • Files crawled each quarter:
    • HTML pages: 1.6 million per crawl
    • PDF documents archived per crawl 260,000

Assume a student needs 5 minutes to review and index a document. To process 2 million documents will require the person to work for 69 weeks.”

Xu’s mission is to integrate frameworks to access digital collections dynamically.

We need a more unified way to handle different types of data, said Xu. The general problem: the majority of data access requests are to retrieve a small subset collection:

  • Access through keyword based search
  • Access through predefined categorization

Meanwhile, we have an increasing need to incorporate access with data analysis:

  • As size of the collection increases, the categorization may become insufficient over time
  • As size of the collection increases, the criteria use for collection access may change over time
  • The collection may contain data without the same unified properties

Some of the challenges in accessing big data?

  • Need scalable methods for analysis
  • Distributed data storage and management

Data must be accessible and converted to knowledge. The proposed framework is simple. Enter the MapReduce Programming Model, a common practice developed by Google used in cloud computing for large scan data analysis. The goal: to automatically attach potential labels to documents in a large collection based on a given set of training documents. The challenge: the content of documents may vary greatly. For example, each document may cover broad subject areas but the training documents can contain "local features" within sub-groups.

Progress so far?

Traditional classification methods based on majority words in the document do not work well. Ongoing projects include synthesized results using multiple features, exploring metadata, and identifying themes in large document collections,” said Xu.

Xu’s conclusion: for large data collections, data access may need more dynamic and flexible solutions that require more computation. A seamless access including both digitalized and born-digital data requires an integrated framework that can meet access needs by different type of data. Map Reduce is a sound candidate for this.

Search, the Neglected Aspect of Digitization

Kenton McHenry

“We sampled a test bed of 1930s census data,” began McHenry. “Big spreadsheets. A lot of data. There’s 3.6 million images in 1930s census data.”

He demoed a folder structure of JPEG2000 images -- microfilm rolls with geospatial data.

He continued,

We’re not sympathetic to computers. You see this grid of pixel value, the grey level? While humans can evaluate intensity levels which show us the letter “a”, computers see a basic grid of numbers. This is at the heart of the big data problem: there are 3.6 million high resolution images -- 7 billion units of information. We have to process this and manage user expectations and fast queries. Today, we manually transcribe thanks to 1000s of people. Weigh the drawbacks, logistics, costs -- and we’re still not 100% accurate. So, we outsource to other countries where labor is cheap. How do we automate? This is the cotton gin for digital search.”

OCR of type-written characters has an accuracy rate between 71 and 98 percent. But our handwriting is mostly connected characters, slants, shapes, degrees of sloppiness. A computer looks for lexicon and controls variability.

The word spotting method doesn’t transcribe words or characters; it builds a distance between images of words and retrieves similar images based on distance and content based image retrieval (CBIR). It looks at incorporated pixels in images, assumes it is handwriting, extracts features that look for features. Extracting features measure distance and it does this for every image in the database. Wordspotting accuracies can be measured in a number of ways: a percentage of top-ranked returned images matching the query; average precision; a percentage of queries matching the top N images.

Initially, word spotting and indexing the test bed of 1930s census data came with an expense of 49,331 CPU hours and indexing at 21,511,276 CPU hours.

McHenry sacrificed; he broke up the indexing activity by states. The plus side: users were allowed to select what they wanted to search. Suddenly the indexing portion dropped to 2 million hours.

If you perform a Google-like search, for example a US census search, one server is generating images/fonts. This is important for human loop aspect. A user types in text. We’re mining images as passive crowd-sourcing (think or, think crowdsourcing that’s actively ocr’ing a book). The hardest is the name field, but it does work. Hidden crowd sourcing takes 213 days or 7 months of processing by a supercomputer. The biggest piece is indexing.”

McHenry demoed examples to support his argument.

The project framework is freely available. The great news: it can be used to provide searchable access to digitized content right now! It’s perfect for small archives working with small research grants. For computation expertise, see computation.

Editor's Note: To read more of Mimi's reports from the SAA Conference:

-- Partnerships New and Old: Preservation in the 21st Century #saa12