Web Archiving: Selection, Capture, Preservation, Marketing #saa12

6 minute read
Mimi Dionne avatar

While archiving web pages may not spring to mind as one of the duties of an archivist, organizations are increasingly recognizing the value of keeping records of websites and social media sites to provide snapshots of their public face and context for other assets.

The Society of American Archivists (SAA) 2012 annual meeting, Beyond Borders, began Monday, August 6 in San Diego with strong pre-conference sessions.  

On Tuesday I attended Web Archiving: Selection, Capture, Preservation, and Marketing,a workshop taught by practitioners: Kelly Eubank, Electronic RecordsManager of North Carolina Department of Cultural Resources; LyndaSchmitz Fuhrig, Electronic Records Archivist of the SmithsonianInstitution Archives and Jennifer Wright, Archives and InformationManagement Team Leader of the Smithsonian Institution Archives.

Ourinstructors explained theory and best practices of websites’ recordsstate, appraisal, social media sites as potential records, capturemethodologies, in-house versus third party vendor solutions and lessonslearned from their own projects.

Why Save?

"Why save websites as records?" posed our instructors. Because they’re an organization’s public face, they provide context to records and the relationships of those records and they capture a picture of an organization’s presence. However, the ladies admitted freely, archivists can’t save everything.

Consider the numbers:

  • 500-600 million websites
  • 4.8 billion social media profiles
  • 1 billion Facebook posts per day
  • 11 new Twitter accounts per second
  • 1 hour of video posted to YouTube every minute
  • 70 million WordPress blogs

Highlights from a recent internal survey of the Smithsonian Institute’s social media usage statistics include:

  • Facebook. October 2010: 70+; April 2012: 140
  • Twitter. October 2010: 60+; April 2012: 90+
  • YouTube. October 2010: 30+; April 2012: 60+

You Have to Crawl Before You Can Walk

They shared another set of statistics from Georgetown Law Library and Chesapeake Digital Preservation Group Study from April 2012: out of a sample of 579 URLs, admittedly mostly government and organizational websites, 38 percent experienced link rot in the past five years. The study reinforced what archivists have said for years: archival repository’s must create and enforce crawling policies.

Hence, archives staff should hold one-to-several collection policy discussions to determine best fit for their repository. Once focus is clarified, however, how do archivists collect sites? Appraisal through a favorite tool -- crawling.

“Crawling results in a large amount of data that needs to be preserved or stored,” the team shared, "Be aware of the following caveats:

  • How much content duplicates other formats of records?
  • Does website design allow duplicate content to be excluded from crawl?
  • Do you have ability to crawl without express permission?
  • Will crawling external links provide significant added value?
  • What is appropriate for you to collect and provide access to?
  • Is there significant value added in capturing the content of links?
  • Are comments, retweets, friend lists, followers and “likes” significant?
  • Does capture method allow content to be included/excluded?
  • What tweaking will the archivist need to apply to the crawler?"

A repository’s crawling policy may differ on an individual website basis.The ladies advised, “It’s best to set a crawl cycle. Crawl before and after any redesign. Crawl on the day of a major event.” To track their crawling project, the reconciliation process can be administered with as modest a tool as an excel spreadsheet with website name and URL, contact information, last and next crawl date and next expected redesign columns.

Appraise Early, Appraise Often

Our moderators shared two schools of thought regarding social media in government: either state government agency sites are collected entirely or local government requires that sites be evaluated by the content on the site (and retention schedules apply to the content, not the media).

“Appraise each account often,” our instructors recommended.

One of the attendees asked, “What is important to document?”

“Think about it: how is the account used? What is its look and feel, its functionality? Your staff should determine what is appropriate: screen shots, crawls, or a combination of the two,” the speakers responded.

Save Early, Save Often

Crawling isn’t always effective. Some providers have their own expert tool. Third party export tools are superior to the painful exercise of capturing PDF/A screenshots. Most social media accounts are frequently updated; check terms of services for clauses related to how long data will be kept/easily accessible. In short: set a reasonable capture cycle (especially just before an account is closed).

Learning Opportunities

Policies, Statutes, Rules

At this time, very few legal decisions impede the progress of web archiving. Web archiving is generally considered fair use except when the robots.txt file is ignored or the page contains “no archive” metatag in the header. To maintain a repository’s defensibility, read Terms of Service carefully because they change frequently.

In 2009, the US General Services Administration negotiated federal agreements with social media sites starting with flickr, Vimeo, blip.tv, and YouTube. Some of the amendments allow for capture of content, such as tumblr.

Advice: when managing public expectations, add disclaimers or link to disclaimers related to any policy beyond the social media provider’s terms of service. Add the disclaimer to the site. General Counsel is the archivist’s best friend; consult with them when designing the crawling policy. Consult with the owner of external websites prior to capturing. Work closely with the donor to honor their preferences in the donor agreement.

Recommended Web Archiving Tools

Based on their experience, the ladies recommended a list of tools:

Hint: take a look at the container format known as WARC, or Web ARChive file format. The international standard ISO 28500: 2009 is an extension of the ARC format in use since 1996.

Third Party Vendors

  • Finally, the ladies offered a series of tips and tricks to follow when working with third party vendors.
  • Determine what URLs you will use.
  • Determine how long you want the crawl to run.
  • Consider automated crawling.
  • Run tests first.
  • Monitor the crawl closely.
  • Third party tools may offer a series of helpful reports:
    • Host report
    • Seed-source report
    • Seed status report
    • PDF report
    • Mime type report
    • Video report 
  • Eliminate the capture of inappropriate content
  • Your crawl will have a budget. Manage it appropriately.


SAA speakers are very good at mentoring in-class discussions and we had excellent fun discussing web archiving scenarios and fair use. Three cheers to our three instructors who obviously walk the talk. Some of us were baffled that this class isn’t a part of SAA’s DAS certificate course; there must be an excellent reason. Next year, this class should be a hands-on lab. I’d attend one three-day pre-conference if it was offered.

My compliments to SAA -- the two preconference sessions I attended this year for Beyond Borders were immediately useful.

The Annual Meeting of the Society of American Archivists continues through Saturday, August 11, in San Diego.

Editor's Note: You can read more about Mimi's experiences at #saa12 here:

-- Digital Forensics for Archivists #saa12


About the author

Mimi Dionne

Mimi Dionne is a records and information management project manager and Consultant/Owner of Mimi Dionne Consulting. She is a Certified Records Manager, a Certified Archivist, a Certified Document Imaging Architect, a Certified Information Professional, and a Project Management Professional.