Managing large volumes of documents and their metadata presents a challenge for any enterprise, but organizations such as financial institutions that need to retain so many documents for audit and compliance purposes have it especially tough.

That’s because organizations that are bound by legal and regulatory requirements face a double burden: They must ensure that documentation is collected and shared properly in the first place, while also creating and applying a system of document descriptors that will be robust enough to guarantee that any given item can be accessed again — in a timely manner — from myriad boxes of papers and forms stored in multiple locations.

Oh, and did we mention that the necessary information might also be archived in the form of emails, webpages or social media posts?

But hey, no pressure.

You Say Tomato and I Say Tomato

To succeed at such a demanding yet open-ended task, organizations must develop a consistent system of metadata or document descriptors.

But that turns out to be a lot easier said than done.

The problem, highlighted in a landmark 1987 study by Bell Communications Research, is that “in information retrieval systems, the keywords that are assigned by indexers are often at odds with those tried by the searchers.”

In fact, in the Bell study, people used the same or similar search terms less than 20 percent of the time. What’s more, when study subjects assigned labels to information based on what lexicon made sense to them, others in the study failed to access the information up to 90 percent of the time.

The Catalog Approach Can’t Keep Up

Multiply that confusion by two decades and thousands of petabytes of new document-based information and the shortcomings of an employee-based catalog approach to curating metadata become painfully clear: Today’s enterprise staffers don’t have time, they don’t have a consistent approach, they’re not accurate and the metadata is changing too fast for them to master anyway.

What’s an organization to do?

Technology to the Rescue

Today’s technology holds great promise for automating both the initial design and the ongoing maintenance of document organization tasks in four key ways:

  1. Automated classification uses advanced machine learning to take on the burden of document organization.
  2. Natural language processing and other linguistic techniques enhance access to document content for purposes of metadata attribution.
  3. Faceted search helps users understand information hierarchies
  4. Interactive search refines results by asking users clarifying questions such as “did you mean apple the fruit or Apple the company?”

Rightsize Your Chosen Solution

Yet the task of selecting a “just right” technology that neither over nor undershoots your organization’s document management requirements can be as difficult as the problem itself.

Fortunately, though, organizations, especially financial institutions, can follow four steps to gain a clearer sense of the size and scope of the task ahead.

Get a Grip on the Problem

Start by understanding the key business processes and what documents are involved in them by conducting interviews on a department-by-department basis.

Next, rank your organization’s various processes in terms of how critical they are to your daily business operations. Then, inventory which key documents are used to support your most business-critical processes and what vital information needs to be captured from each document.

Integrating these results will give you a high-level process map which lists your organization’s document classes, prioritizes them and details how they are used.

Gather Department-Level Lexicons

Work with stakeholder departments to understand the terms they use for different information. Chances are, two departments will refer to the same document in different terms.

If your company has search software, talk with your IT department to get a list of the most commonly used search terms and document links clicked.

You’re aiming to find the “best possible names” for each document based on frequency of use. Collect as many “best possible names” as you can, with the goal of describing between 60 and 80 percent of your searches.

Get Semantic

Use a thesaurus to augment the list you generated in your department-level research with synonyms that can be used to find the same information. The more related words you can find to describe documents, the better.

The objective is to generate a solid list of keywords within each department that also maps to variants that might be used, eliminating the need to interview everyone in every department of your organization.

Size-Up Your Current Needs

If your organization already has an enterprise content or document management system, you probably already have metadata fields and search options. Getting a handle on how your organization uses documents and how it references them during a given business process will make your next steps become more obvious.

For example, if your document types are sufficiently controlled and your processes well enough established on the input side, all that’s needed may be a straightforward document capture and indexing solution. On the other hand, if your information varies widely and defies easy classification, you’ll know that a more advanced document classification capability will be necessary in order to handle the variance.

Similarly, on the retrieval side, you’ll be able to gauge whether your existing document hierarchy is highly structured with well-defined processes. If so, it can likely be managed with faceted search that provides document retrieval tools to help a user select the correct descriptions. For more sophisticated or broad-based retrieval needs however, you may need a more-advanced natural language processing (NLP) indexing and search platform capable of translating your user input into semantic queries.

Title image "Storage" (CC BY 2.0) by bark