Spurred on by an online debate about the distinction between text analytics and semantic content enrichment, I turn in this articleto the pressing question: "What does semantic content enrichment mean?" 

As IBM's Marie Wallace remarked, it’s great to see the term semantic content enrichment generating discussion although, she continued, "I suspect that most people still don’t differentiate it from just text analytics."

The Distinction

Oh, but there is a difference. Let’s explore it via the definitions that follow, first of text analytics, then content analytics and finally content enrichment and where the ensemble takes us.

First definition:

Text analytics is a set of software and transformational steps that discover business value in “unstructured” text. (Analytics in general is a process, not just algorithms and software.) The aim is to improve automated text processing, whether for search, classification, data and opinion extraction, business intelligence or other purposes.

To expand on this definition a bit, to bridge from text to the wider content world:

Text analytics draws on data mining and visualization and also on natural-language processing (NLP). Supplement NLP with technologies that recognize patterns and extract information from images, audio, video and composites and you have content analytics.

The concept of content enrichment is easy to grasp: Every link in this article -- Web links are accomplished via the HTML “a” anchor tag -- is a bit of content enrichment. And semantic content enrichment? Marie Wallace puts it this way, focusing on text but with concepts that extend to the broad set of content types:

When I think about semantic enrichment, I see it as transforming a piece of content into a linked data source. In order to do this you do indeed need text analytics for entity and relationship extraction, but you need more than that…. A text analytics engine might recognize that [Marie Wallace] is a person, [Ireland] is a place, and Marie comes from Ireland and annotate the entities/relationships found. However when doing semantic enrichment, I would want to convert those annotations to openly addressable URIs that contribute to the linked data cloud.

URIs are uniform resource identifiers, Semantic Web terminology for IDs, unique within a namespace, that name or locate things. Web URLs (e.g., http://whitehouse.gov/) are a type of URI.

Rather than write my own annotation elaboration, I’ll reuse one from Ontotext, a semantic-technology developer:

Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc. to a document or to a selected part in a text. It provides additional information (metadata) about an existing piece of data.

Semantic Annotation goes one level deeper:

  • It enriches the unstructured or semi-structured data with a context that is further linked to the structured knowledge of a domain.
  • It allows results that are not explicitly related to the original search.

The earliest specific semantic content enrichment reference I’ve encountered is in an Ontotext paper, Towards Semantic Web Information Extraction, presented at the 2003 International Semantic Web Conference (ISWC).

Learning Opportunities

The paper covers work based on Ontotext’s Knowledge and Information Management (KIM) platform, which in turn relies on GATE, the General Architecture for Text Engineering, an open-source text-analysis framework and toolkit, Apache Lucene and other technologies.The Ontotext folks have other, related papersposted on the company Web site.

Complementary Processes

The Ontotext materials help explain the role text/content analytics can and should -- but doesn’t often enough -- play as a Semantic Web generator. The entities, concepts, events and other features discerned, via content analytics, in text and rich media not only enable smart content; they can also be loaded to knowledge bases (which I won’t get into here, other than to say that systems such as IBM Watson and Wolfram Alpha use them) and Semantic Web triple stores.

There are other solution providers in the content analytics meets semantic annotation/enrichment game. In addition to IBM and Ontotext, they include HP Autonomy, MarkLogic, OpenText, Temis and the nascent, open-source IKS project. Other vendors offer enterprise-strength building blocks, for instance, SAS via the various SAS Text Analytics components.

I’m sold on this stuff given the business benefits for content producers and content consumers alike. These technologies -- and the interplay between analytics and semantics -- are key in making sense of the digital universe.

Editor's Note: You may also be interested in reading: