Quick Start Guide to Archiving Documents in the Cloud

What is an Archive?

An “archive” is a system that at minimum:

  1. Securely stores documents; and here I use “documents” as shorthand for user content, including email, docs, social media and web pages
  2. Retains the documents as long as needed
  3. Purges documents when they are no longer needed for legal, compliance or business purposes
  4. Provides authorized users (internal and external) with access to the documents for various purposes (e.g. for business processes, customer service, customer or agent self-service, and discovery)

In service of the above requirements, archives typically include deduplication, indexing and some e-discovery capabilities.

If you look at the way your company and your peer's companies have done archiving in the past, you can see how it has evolved. For unstructured data or content (“documents”) most archiving was historically focused on fixed content like system generated output (such as statements, EOBs, correspondence). Images were also included.

When email became an enterprise concern due to its volume and risk, it was addressed as a content type requiring archiving. Email archiving has not been an unqualified success -- and I tell that story below. More recently -- again because of the entailed volume and risk -- the chaotic swamp of dynamic documents (like Microsoft Office docs), web content and collaboration content started to be archived, along with other forms of e-communications like instant messages.

Which brings us to today, when most large companies are interested in archiving all of the above, plus transactional, unstructured data from business systems, plus rich media like audio and video, plus on occasion entire applications.

A Lesson from the History of Email Archiving

It’s important to understand what you want your archive to do, since there are lots of options out there and you need a good fit.

Let me tell you the story of email archiving to put things in perspective. In the early 2000s a lot of vendors from the ECM space tried to move into the email and related archiving space. How hard could email management be? So they tried to use their general ECM capabilities for archiving and add RM capabilities -- thus providing more features and functions than the less fancy pure play archives were offering. But the ECM vendors couldn’t do the basic blocking and tackling for email archiving. They failed at all four points above:

  • They couldn't scale to handle the numbers of users and mailboxes (1)
  • They failed to provide reliable, fast access to users who wanted to find and retrieve older emails and attachments (4)
  • Some of them “lost” attachments (1, 2 and 4),
  • And they failed to provide reliable disposition -- because users defected and squirrelled away emails, not trusting the enterprise archive to do its advertised job (3).

So many organizations dumped their ECM-based archive approaches and went back to the archive specialists, who were able to scale, etc.

Archiving now offers many more options than it did 12 years ago. You can archive everything from social media chats to web pages to movies to old fashioned email and mainframe print streams. You can use the archive for compliance, for active use in complex and demanding business processes, for beyond-the-firewall customer access and participation, and for rigorous e-discovery.

These are all very different scenarios with different requirements. And -- in a nod to this article’s focus -- you can do it in house or via the cloud. So you have to be clear about what you want the archive for.

What Should Your Archive Do?

Start with these key general requirements for archiving. You will weight these according to your situation, and will probably insert additional, more specialized requirements, such as compliance supervision (e.g. for financial services), advanced e-discovery, focus on particular file types (IM, Groupwise, video, web page archiving, salesforce.com), etc. The most important high level requirements for enterprise document archiving are:

  1. Scalability and Performance
  2. Accessibility and Availability
  3. Security and Protection
  4. Retention and Integrity
  5. Disposition
  6. Integration

Let’s address each briefly in turn.

1. Scalability and performance

The archive should handle the volumes of ingestion within the time windows necessary to provide your business with access to relevant documents when you need them within your business processes. In addition, the archive should provide reasonable response times for document search and retrieval, and the solution should have the ability to perform ingestion and archive functions without negatively impacting overall system performance for users.

2. Accessibility and availability

The archive should provide a mechanism for authorized users to search for and retrieve documents. In addition, the archive should provide the ability for certain external users to retrieve documents, such as e-presentment for customers or agents.

This requirement is very important -- not just for the obvious reasons that you want to get the right information to the right (authorized) persons at the right time -- but because messing this up will sink your hopes for using the archive for defensible disposal. If you don’t provide fast (enough), reliable access to documents, your users will defect and squirrel away their emails, social media objects and other items. And not only will you be unable to implement a defensible purge strategy -- you’ll also have the very difficult challenge of winning the defectors back once you lost them.

3. Security and protection

The archive should have the ability to restrict access to documents, such as for documents that are private, confidential, privileged, secret or essential to business continuity. This may include requirements for encryption of stored content. Some vendors are getting sophisticated about this, providing double blind key architectures, with keys held only by the customer for enhanced data privacy and security in the cloud.

4. Retention and integrity

This is obvious -- but the archive should be able to retain documents for defined periods of time, taking into account legal, regulatory, fiscal, operational and historical requirements. In addition, the archive should provide a suitable guarantee of authenticity. And finally, (if this applies to you) the archive should provide the ability to retain information on an unalterable storage platform when needed (e.g., WORM storage for SEC 17a-4 compliance).

5. Disposition

The complement of ensured retention is defensible disposition: the archive should support purging when your documents hit their defined retention periods -- and these should be both time- and event-based.

Event-based retention periods are much more complex than time-based retention periods. When you start doing electronic records management -- even with a more sophisticated ECM or RM system -- you'll find that you'll have to dramatically simplify your retention schedule if it’s going to work. This is often a 10-fold reduction -- from insane 2000 series schedules to 200 or fewer. Then try to reduce the event-based triggers. One best practice is to combine related time and event-based triggers and associate them with a long term event. You will be retaining some records longer than you would in an ideal world, but you'll get the job done.

This advice applies to archives, especially when you start out. Start with big general categories and then ratchet up the granularity. In addition, the archive should support a formal approval process before purging, and it should support override of purging in cases where documents are under legal hold. Finally, the archive should enable authorized staff to periodically review and potentially modify retention periods.

6. Integration

The archive should have a standards-based architecture and open API that allows integration with other systems or middleware components, including existing legacy systems in use at your company. Cloud-based archiving may provide you with two challenges with respect to integration that you should be aware of.

First, you have to figure out how you’re going to integrate with the off-prem archive. And second, if you are going with a service provider, the archive technology under the hood may not be visible to you. You should dig into it and see what they’re using -- it’s a big deal for most of the requirements I’ve discussed, such as scalability.

On Premises vs. Cloud-based Archiving

We've outlined the general requirements for archiving, now a quick look at the general pros and cons of on-premises versus cloud-based archiving.

On-Prem Archiving Solutions

Many of the on-prem solutions are purpose-built for archiving, and most of them focus on archiving files and email for “defensive” and IT purposes. But some of them came from the mainframe output archive and retrieval space, or the enterprise application document and data archiving space.

Pros

  • Many of these solutions are very mature and robust. I sometimes believe that when all human life has vanished from the earth, some of these systems will still be running, ensuring retention and defensibly disposing on schedule.
  • The best ones can also scale wonderfully and can address very complex scenarios -- with lots of integrations with upstream and downstream systems, more than 100k users, and more than a petabye of objects.

Cons

  • There’s a lot of what we call vendor and product risk: many of the vendors are from an older era and are clinging to what they know without sufficient footprint to support product development. Many started with one kind of archiving -- like mainframe output -- and have spread to others (like all types of content, or e-discovery, or social content) but they aren't particularly good at it. This echoes the email archiving story above.
  • The in house solutions typically require a lot of resources to get rolling. You’re investing in all the infrastructure and many of them are complex. Some of this complexity is good in that it allows them to address your complex archiving needs, but some stems from bad or antiquated design, or is overkill if your needs are modest or you’re not a big company.
  • Finally, some in this category don’t have the granular RM capabilities you may want to grow into. To get any event-based retention, for example, you might have to create a Frankenstein solution with other products and vendors.

Cloud-based Archiving Solutions

These are hosted solutions for archiving large volumes of content, under a subscription-based model that may include charges based for volumes stored, numbers of users, retrieval volumes, etc.

Pros

  • There’s a good business case for them, both in terms of upfront costs and savings over the next several years.
  • It’s a growing market. By no means is it consolidating -- the influx of newer entries is still outpacing acquisitions and disappearances.
  • It’s less resource-intensive to implement and maintain -- and because of this, there’s less risk of implementation failure. I believe most organizations and vendors wildly underestimate the likelihood and negative impact of failed deployments, so I find this advantage very important.
  • They can provide you with scalability and flexibility -- when all things are considered, and even though you can’t customize your archive like an arts and crafts project.

Cons

  • There are some disadvantages, though many are no longer true, particularly for the better solutions. Security has gotten much better, for example, and is often better than whatever you may do onsite.
  • Other disadvantages apply to the on-prem approach as well, but are more dangerous for cloud-based archiving because it’s newer and sometimes less clear what you’re getting into. For example, some solutions won’t meet your security and accessibility requirements (despite what I just said) – and determining whether the provider can do so will require more diligence than the other approaches because there isn’t a clear track record.
  • Many may not meet your functional requirements: as with the on-prem archiving vendors, many of these vendors are spreading beyond general archiving or moving into other areas (like email management, e-discovery, social) -- so again, dig to make sure they have a proven track record of providing your kind of archiving in full production.

Archiving is the oldest and most mature ECM technology to be offered off premises. It dates back to the 90's and we have many clients who have been doing it successfully for more than 10 years. It’s definitely worth considering -- but be sure to assess your specific archiving requirements before pursuing this approach.

Title image by Corey Seeman (Flickr) via a CC BY-NC-SA 2.0 license