Auto-Classification: Friend or Foe of Taxonomy Management?

Ah, metadata and taxonomies: what’s not to love? Ask your content creators and you’ll be told loud and clear: having to apply said metadata and taxonomies to content through tagging.

Although rich and consistent indexing makes content more findable and easier to evaluate, nobody relishes the idea of tagging multiple fields, and it can take considerable effort to make it happen. The struggle is always to keep up with the ever-increasing flow of content. And even if you have a team of indexers, you still have issues of indexing consistency and scalability. One solution is to turn to auto-classification.

Auto-Classification Systems

Taxonomies and thesauri are the foundation of an auto-classifier. They provide the vocabulary against which rules are built and “teach” the machine how to “understand” and categorize content. Most systems require a rich thesaurus to function.

There are two broad types of auto-classification: statistical and rules-based.

Rules-based

Rules-based systems rely on simple Boolean (and, or, not) categorization rules to find either positive or negative evidence of a match to a category. For example, rules for content about astronomy care specifically that the keyword “Mercury” is matched but NOT in proximity to the words “car” or “chemical.”

The success of rules-based systems is heavily dependent on the richness of the taxonomy and collection of synonyms/keywords associated to each category. Weighting can be used to adjust how the system interprets the relevancy of a specific keyword to a specific category.

The challenge is in creating and/or tweaking the rules for each category and keyword combination -- which can be onerous for large taxonomies. There are, however, many tools that auto-create a set based on your taxonomy that you could then edit.

Statistical

Statistical systems use word frequency and location to determine important or useful concepts. You have to train a statistical system using a representative sample set of documents (usually 50+) for each concept in your taxonomy. You feed it some example texts that you would classify in that category, and it analyzes the text to extract statistically significant keywords and patterns. In the Mercury example, the system would deduce that content about mercury the chemical should not be classified because none of the training documents mentioned chemicals.

This method can be challenging for a few reasons: it requires a lot of effort to create the training set, it relies on the availability of keyword-rich text, and when there is a problem with the classification results, it can be tricky to identify the source.

Auto-Classification & Taxonomy Management Tools

Auto-classification (or auto-indexing) is often found within larger toolsets, such as document management systems and search engines (e.g. Fast, Endeca, Exalead, etc). However, these often only allow you to import a taxonomy and apply it to the content. You cannot manage your taxonomy in these systems, so you end up needing a second tool for vocabulary management.

Many of the popular taxonomy management suites include auto-classification modules. This allows you to combine the management of the categories and keywords/synonyms against which you’ll be classifying and the associated rules or training sets. With few exceptions, taxonomy tools with auto-classifiers are generally rules-based systems. They automatically create a set of rules based on the taxonomy stored within, creating new rules each time a new term is added or changed in the vocabulary. This reduces the effort in rule-based management.

There are a number of tools on the market that fully merge taxonomy management and auto-classification, including:

Smartlogic’s Semaphore (rules-based)
Data Harmony’s MAIstro suite (rules-based)
SAS Enterprise Content Categorization & Ontology Management
Wordmap (statistical, mostly SDK approach)

Learning Opportunities

Webinar

Oct

Agentic AI Playbook: Real-World Customer Service Use Cases You Can Deploy Now

Boost self-service by 30% and slash call volume by 63% with agentic AI.

Webinar

Oct

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Webinar

Nov

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Webinar

On demand

Call Spoofing Trends Financial Institutions Can't Afford to Ignore

Don't let robocalls sabotage your customer relationships. Learn how to secure your voice channel.

Watch Now

Webinar

On demand

CMS Briefing: A Live Look at What’s Next in AI-Driven Platforms

Learn how leading organizations are using AI‑driven tools to publish faster, personalize smarter and stay secure.

Watch Now

Webinar

On demand

Ready or Not: How Data-First Organizations Are Unlocking Agentforce Potential

Learn how to cut through the noise, activate Agentforce and build a Salesforce AI strategy that actually delivers.

Watch Now

Webinar

Oct

Agentic AI Playbook: Real-World Customer Service Use Cases You Can Deploy Now

Boost self-service by 30% and slash call volume by 63% with agentic AI.

Webinar

Oct

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Webinar

Nov

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Automated vs. Assisted

Auto-classification sounds good on paper -- why wouldn’t we want to have machines do our dirty work? The accuracy of auto-classification typically ranges between 60% and 85%. It can get higher with very exhaustive training sets or rule bases, but the amount of investment required to achieve this accuracy diminishes your ROI. If higher accuracy is a requirement, you can also turn to “assisted” classification, where the system prompts human indexers with classification suggestions that can be reviewed and rejected or enriched. This results in less errors (either incorrect categories or omissions) and faster indexing times as some of the work is done by the machine and can just be validated.

Conclusion

If you are interested in an auto-classification solution, consider addressing both your taxonomy and auto-classification needs with a single tool. Whether you choose a rules-based or statistical system, make sure you get the vendor to help you understand the level of effort required in building and tweaking the rules base or training set (which can be anywhere from one hundred to thousands of hours) and budget accordingly.

My next article will be on auto-classification options for SharePoint.

Editor's Note: You may also be interested in reading: