While archiving web pages may not spring to mind as one of the duties of an archivist, organizations are increasingly recognizing the value of keeping records of websites and social media sites to provide snapshots of their public face and context for other assets.

The Society of American Archivists (SAA) 2012 annual meeting, Beyond Borders, began Monday, August 6 in San Diego with strong pre-conference sessions.  

On Tuesday I attended Web Archiving: Selection, Capture, Preservation, and Marketing, a workshop taught by practitioners: Kelly Eubank, Electronic Records Manager of North Carolina Department of Cultural Resources; Lynda Schmitz Fuhrig, Electronic Records Archivist of the Smithsonian Institution Archives and Jennifer Wright, Archives and Information Management Team Leader of the Smithsonian Institution Archives.

Our instructors explained theory and best practices of websites’ records state, appraisal, social media sites as potential records, capture methodologies, in-house versus third party vendor solutions and lessons learned from their own projects.

Why Save?

"Why save websites as records?" posed our instructors. Because they’re an organization’s public face, they provide context to records and the relationships of those records and they capture a picture of an organization’s presence. However, the ladies admitted freely, archivists can’t save everything.

Consider the numbers:

  • 500-600 million websites
  • 4.8 billion social media profiles
  • 1 billion Facebook posts per day
  • 11 new Twitter accounts per second
  • 1 hour of video posted to YouTube every minute
  • 70 million WordPress blogs

Highlights from a recent internal survey of the Smithsonian Institute’s social media usage statistics include:

  • Facebook. October 2010: 70+; April 2012: 140
  • Twitter. October 2010: 60+; April 2012: 90+
  • YouTube. October 2010: 30+; April 2012: 60+

You Have to Crawl Before You Can Walk

They shared another set of statistics from Georgetown Law Library and Chesapeake Digital Preservation Group Study from April 2012: out of a sample of 579 URLs, admittedly mostly government and organizational websites, 38 percent experienced link rot in the past five years. The study reinforced what archivists have said for years: archival repository’s must create and enforce crawling policies.

Hence, archives staff should hold one-to-several collection policy discussions to determine best fit for their repository. Once focus is clarified, however, how do archivists collect sites? Appraisal through a favorite tool -- crawling.

“Crawling results in a large amount of data that needs to be preserved or stored,” the team shared, "Be aware of the following caveats:

  • How much content duplicates other formats of records?
  • Does website design allow duplicate content to be excluded from crawl?
  • Do you have ability to crawl without express permission?
  • Will crawling external links provide significant added value?
  • What is appropriate for you to collect and provide access to?
  • Is there significant value added in capturing the content of links?
  • Are comments, retweets, friend lists, followers and “likes” significant?
  • Does capture method allow content to be included/excluded?
  • What tweaking will the archivist need to apply to the crawler?"

A repository’s crawling policy may differ on an individual website basis. The ladies advised, “It’s best to set a crawl cycle. Crawl before and after any redesign. Crawl on the day of a major event.” To track their crawling project, the reconciliation process can be administered with as modest a tool as an excel spreadsheet with website name and URL, contact information, last and next crawl date and next expected redesign columns.

Appraise Early, Appraise Often

Our moderators shared two schools of thought regarding social media in government: either state government agency sites are collected entirely or local government requires that sites be evaluated by the content on the site (and retention schedules apply to the content, not the media).