The Social Networks and Archival Context (SNAC) project is an ambitious one that seeks to locate records of historical importance across repositories and make them available to patrons on a massive scale. Our panel updated us on its fascinating progress. Look at what we records and information management professionals can do.
The Society of American Archivists (SAA) 2012 annual meeting, “Beyond Borders," concluded Saturday, August 11, 2012 in San Diego.
Tammy Peters of the Smithsonian Institute introduced her panel:
- Ray R. Larson (University of California, Berkeley)
- Daniel Pitti (University of Virginia, Institute for Advanced Technology in Humanities)
- Jerry Simmons (National Archives and Records Administration).
The Social Networks and Archival Context Project: Status Report
Ray R. Larson
Mr. Larson delivered an update to SNAC. Officially, the goals of the project are to further the transformation of archival description and to separate description of records from description of people documented in them. Translation: the project is meant to make available records of historical importance and
- enhance access to archives resources, through all cultural heritage resources; and
- enhance understanding of those resources.
We’re talking big data. With a sample of 150,000 EAD-encoded finding aids contributed from around the world by national libraries and others, including:
- Library of Congress
- National Archives and Records Administration
- Smithsonian Institution
- British Library
- Archives nationales (France)
- Bibliothèque nationale de France
- OCLC WorldCat and VIAF
- Getty Vocabulary Program.
Institutes like the Getty Vocabulary Program have contributed a union list of artist names (make that: 293,000 personal and corporate names).
The problem: a proliferation of the forms of names (for example, different people with the same names). EAD records are full of family names and within the structure it notes the creator of the archive (typically the complete autobiography is provided). This autobiography is extracted to the Encoded Archival Context for Corporate Bodies, Persons, and Families records (EAC-CPF) record.
We’re given names — sometimes multiple names. Identical names means a complete Library of Congress record with attributes is available. If it’s an exact match, it’s marked. But marking doesn’t work for everything. Abbreviations are troublesome — think transliteration of non-roman characters. We take names where we didn’t get an exact match, then test against library authority files. Do we find an exact match? We flag it as a potential merge. Is nothing matched by this stage? We create overlapping segments of three characters. Finally, we take all flagged as potential matches, do a find, make sure these are the ones we want. With the authoritative form of the name, we combine all EAC-CPF records. To give you an idea of volume, a recent test merged 93,033 person names from 114,639 person records," said Larson
In other words, the names are extracted from EAC-CPF and from existing EAD. If the EAC-CPF records match against one another and against existing authority records (for example, VIAF), then prototypes of historical resources and accessibility are created.
The most recent extraction results:
- Total: 175,637 EAC-CPF from 30,496
- corporate Body: 47,189
- person: 12,554
- family: 2,894
What’s important to Randy the Researcher: we’re creating standardized personas for target audiences. This is meant to be really elegant with enhanced search, merging information from multiple sources, multiple fields from finding aids. Our future plans include conducting an assessment of activities involving members of target audiences to establish mental models. We’re going to scale the interface to millions of names; we will create visualizations that are both useful and integrated; we’ll create stable URLs between batches; provide social and personalization features; and integrate with local systems.”
Establishing a National Archival Authorities Cooperative: Developing a BluePrint
Funded by the Institute of Museum and Library Services, the objective of the NAAC is to realize archival authority description at last, because archival authority control needs to be cooperative. “Imagine consistent use of names for the same entity across descriptions. The need to maintain only a single set of shared authority records and the economic benefits of cooperation outweigh the effort,” asserted Pitti.
- IBM: Our Verse Email Beats Anything from Microsoft, Google
- Box Cops to Bad IPO Timing, It's Time to Unbox
- Extracting Insight from Unstructured Data
- 7 Reasons Why Facebook at Work Will Fail
- Trends in Web Content Management From #jboye14
- Are You Too Old to Work in Tech? IT's Midlife Crisis
- Who Are the 100 Fastest Growing Software Companies?