Recent news about surveillance of communications metadata has propelled this normally arcane part of the information world to the top of the stack.
If you’re interested in learning more about metadata, there is no lack of resources: a Google search yields more than 100 million results and even Wikipedia uses 14 pages and 35 references just to explain metadata.
But if you’re an information creator, manager or publisher, all of this admittedly valuable information can seem like an impenetrable maze, especially if you’re facing the need to design your information resources and technology: lots about metadata but not so much about how to design and manage it for your particular need. The currently favored description of metadata as “data about data” is nominally correct but not very illuminating.
Metadata for Content Creators
Let me suggest that metadata may be viewed in two major functional categories, each one important in the design and implementation of information automation efforts:
- Metadata as data about the Physical structure and processing of information files: ignoring for the most part the intellectual content of the files. Let’s call this “structural metadata.”
- Metadata as data about the intellectual content of information files with particular emphasis on ways to locate the content. Let’s call this category “finding aid metadata.”
This category is the most analyzed and described of the two metadata categories, perhaps because it is the most tightly integrated with the technology used to create and process files. In truth, virtually everyone creates and uses this type of metadata. Word processors, for example, create and keep data about the dates, authors, revisers and accesses of every document they generate. MS Office even uses XML as its underlying data markup, for content as well as metadata and other properties. While this won’t help much in making your content usable, it makes the metadata in MS Office files easily accessible via standard XML software tools.
What’s important about structural metadata is that because it is mostly computer to computer communication, users have less control over it when designing or planning for content creation. Perhaps the most important aspect to structural metadata for the content creator and manager is in selection of software that does a good job of creating it, both in open formats like XML and with a robust set of values, making the resulting files more transparent and valuable to downstream software and users.
Finding Aid Metadata
For content creators, managers and publishers, this is the meat of metadata. It has been said that a key component of information is in finding it. Melvil Dewey understood this when in the 1870s he devised a numerical method of classifying books and monographs by subject, a design sufficiently robust that we still use it -- the Dewey Decimal System (DDS).
While technology has changed the ground rules, the concept of finding aids is no less important than it was then, perhaps becoming even more so as the sea of content in which we live has grown exponentially. Finding a needle in the haystack, after all, becomes increasingly more difficult as the haystack grows.
Historically, library finding aids depended on the creation of external cataloging and catalog cards or printed lists to lead users to the desired content. Books could be shelved in gross subject matter sections, but anything beyond that required a visit to the card or book catalog to find the proper Dewey subject classification and books that fell within it.
With the dawn of the Internet and electronic display of content, we could begin to merge the finding aids with the content itself: first by sequentially searching entire files, then by developing search engines capable of building and rapidly searching inverted word lists for entries that matched the user’s query.
As the sheer size of the Internet grew, however, search engines needed more detailed and specific information to perform, sometimes including terminology not present in the actual content. To deal with this, concepts like keywords in files, external search terminology, usage pattern analysis and other techniques came into use.
At the same time, users, dealing with massive search hit lists, began agitating for more direct ways of finding what they wanted. In short, they wanted to navigate to their desired information rather than searching, however efficiently, against entire collections of content. Concepts like “faceted” cataloging and search came into use as a means to organize content to allow users to navigate down defined paths or “facets” to the content they are seeking.
In all of this, metadata grew in importance.
These evolving techniques ask a lot of the content creator, often well beyond what might be expected. So if you’re one of those organizations with a need to make your content easier and quicker to find, then the questions remain: “how do I do it?” and “what will it cost me?”
What follows are some thoughts on those questions, from an operational perspective.
Content: How Important is it, Really?
Many technologists view content as raw material for those miraculous technological tools, spending little time on what makes the best and most valuable content resource. XML, for example, is normally referred to as “semi-structured” if not down right “unstructured,” and software processes that can make acceptable content out of word processing and raw text files often ignore the fact that “acceptable” usually isn't good enough outside the laboratory.
But content is often the only repository of the collective information value within the organization, and what is not captured initially may be severely marginalized. So the approach to content for information creators should be first, to decide what is valuable in the organization’s information environment; then, how best to capture it so that it may be used to its full extent down the line; and, finally, to select or design a content format, including metadata, capable of translating that value into information products that fully justify their cost.
In a previous column, I laid out an approach to this kind of content-centered effort, and I can recommend it from experience.
Finding Aids; Where to Put Them
If we accept the primacy of content in the value proposition for most information-based organizations, we have by definition accepted the importance of finding aids in harvesting that value.
Broadly defined, finding aid data can be generated and stored either embedded within the content itself, or separately and inserted at the top or linked to those files. With word processing or binary media files, the latter is often the only path open to the creator, storing finding aids in external databases with pointers to their content. But more and more, content is being represented in non-binary forms like XML -- capable of embedding the needed finding aid data at their points of relevance within the files themselves. This type of file architecture effectively marries the information value and the paths to its easy location.
So the content creator/manager should organize its efforts to create content with the maximum possible information value and the easiest path for its delivery to prospective users, including receptivity to metadata as a major design consideration.
Standards: A Mixed Blessing?
What’s not to like about standards?
Like most human execution of laudable concepts, all is not rosy with metadata standards. A well known information professional often said that “the greatest thing about standards is that there are so many to choose from.” Unfortunately, there is little general agreement among industry and public sector groups on what content, including metadata, should look like, and the metadata standards communities have tended to be focused on their particular sets of needs.
While any standard, however broad, must support the specifics of each community’s needs, if those specifics begin at too high a level, no common platform develops to allow adopters to work in a common, although customizable, environment. In the end, a standard is valuable less for its elegance than for the breadth of its adoption and the information graveyard is littered with standards that, although brilliant in their construction, just didn't catch on.
So creators and managers of information content should look very closely at the standards for both their content and metadata -- and there are a number.
Standardize if You Can, but be Prepared for Change
If there is a widely used standard relevant to the kind of content being developed, by all means adopt it. But remember, your metadata standard should be compatible with your content standard so the two don’t end up fighting each other. Moreover, always assume that the standards will -- as they always do -- evolve, and choose your solutions so that this evolution won’t invalidate what you have put in place.
Where possible, adopt standards that use the most open forms: XML is a worthy goal because it is easy to generate, flexible and efficient to process with low cost tools.
To the greatest extent possible, involve your content creators in the finding aid metadata development process, using standards to build the most value into your content as you possibly can. Valuable and easily processed content is like money in the bank.
Despite the temptation in some parts of the technological community to view metadata as the main course, it is best dealt with as a valuable seasoning to be added to the real entrée: the content and its intellectual value. If we can keep this recipe in the right order and use the ingredients carefully, we can serve up a truly delectable information meal
Failing that, it could be all fast food.
Title image courtesy of Aleksander Mijatovic (Shutterstock)
Editor's Note: Hungry for more of Barry's insights? Read his Bridging the Content Provider / IT Divide