Ah, metadata and taxonomies: what’s not to love? Ask your content creators and you’ll be told loud and clear: having to apply said metadata and taxonomies to content through tagging.
Although rich and consistent indexing makes content more findable and easier to evaluate, nobody relishes the idea of tagging multiple fields, and it can take considerable effort to make it happen. The struggle is always to keep up with the ever-increasing flow of content. And even if you have a team of indexers, you still have issues of indexing consistency and scalability. One solution is to turn to auto-classification.
Taxonomies and thesauri are the foundation of an auto-classifier. They provide the vocabulary against which rules are built and “teach” the machine how to “understand” and categorize content. Most systems require a rich thesaurus to function.
There are two broad types of auto-classification: statistical and rules-based.
Rules-based systems rely on simple Boolean (and, or, not) categorization rules to find either positive or negative evidence of a match to a category. For example, rules for content about astronomy care specifically that the keyword “Mercury” is matched but NOT in proximity to the words “car” or “chemical.”
The success of rules-based systems is heavily dependent on the richness of the taxonomy and collection of synonyms/keywords associated to each category. Weighting can be used to adjust how the system interprets the relevancy of a specific keyword to a specific category.
The challenge is in creating and/or tweaking the rules for each category and keyword combination -- which can be onerous for large taxonomies. There are, however, many tools that auto-create a set based on your taxonomy that you could then edit.
Statistical systems use word frequency and location to determine important or useful concepts. You have to train a statistical system using a representative sample set of documents (usually 50+) for each concept in your taxonomy. You feed it some example texts that you would classify in that category, and it analyzes the text to extract statistically significant keywords and patterns. In the Mercury example, the system would deduce that content about mercury the chemical should not be classified because none of the training documents mentioned chemicals.
This method can be challenging for a few reasons: it requires a lot of effort to create the training set, it relies on the availability of keyword-rich text, and when there is a problem with the classification results, it can be tricky to identify the source.
Auto-Classification & Taxonomy Management Tools
Auto-classification (or auto-indexing) is often found within larger toolsets, such as document management systems and search engines (e.g. Fast, Endeca, Exalead, etc). However, these often only allow you to import a taxonomy and apply it to the content. You cannot manage your taxonomy in these systems, so you end up needing a second tool for vocabulary management.
Many of the popular taxonomy management suites include auto-classification modules. This allows you to combine the management of the categories and keywords/synonyms against which you’ll be classifying and the associated rules or training sets. With few exceptions, taxonomy tools with auto-classifiers are generally rules-based systems. They automatically create a set of rules based on the taxonomy stored within, creating new rules each time a new term is added or changed in the vocabulary. This reduces the effort in rule-based management.
There are a number of tools on the market that fully merge taxonomy management and auto-classification, including:
- Smartlogic’s Semaphore (rules-based)
- Data Harmony’s MAIstro suite (rules-based)
- SAS Enterprise Content Categorization & Ontology Management
- Wordmap (statistical, mostly SDK approach)
Automated vs. Assisted
Auto-classification sounds good on paper -- why wouldn’t we want to have machines do our dirty work? The accuracy of auto-classification typically ranges between 60% and 85%. It can get higher with very exhaustive training sets or rule bases, but the amount of investment required to achieve this accuracy diminishes your ROI. If higher accuracy is a requirement, you can also turn to “assisted” classification, where the system prompts human indexers with classification suggestions that can be reviewed and rejected or enriched. This results in less errors (either incorrect categories or omissions) and faster indexing times as some of the work is done by the machine and can just be validated.
If you are interested in an auto-classification solution, consider addressing both your taxonomy and auto-classification needs with a single tool. Whether you choose a rules-based or statistical system, make sure you get the vendor to help you understand the level of effort required in building and tweaking the rules base or training set (which can be anywhere from one hundred to thousands of hours) and budget accordingly.
My next article will be on auto-classification options for SharePoint.
Editor's Note: You may also be interested in reading: