The majority of data on the web and elsewhere is unstructured, meaning that you cannot make assumptions that easily break the data down into components for processing. So what do you do when you need to process such information?

OASIS, the Organization for the Advancement of Structured Information Standards (news, site), has approved Version 1.0 of the Unstructured Information Management Architecture (UIMA), which allows some level of structure to be assigned to the documents, email, speech, images and video produced during human communication.

According to OASIS, unstructured information is the largest and fastest growing source of knowledge for governments and businesses -- an estimated 80% of the information generated in the world. It is also the most current, which means you really need to be able to get access to it to run your business well.

While most of our attention has been on the OASIS CMIS standard process, another critical standard has been in development to support accessing unstructured data.

UIMA 101

UIMA is not a markup language. You make no individual changes to your blogs or other unstructured content. Instead, you build applications that analyze the content looking for specific semantics. The application then marks up the content for you -- perhaps with XML tags -- adding structure to your unstructured data.

For an example, say that you have thousands of history-related artifacts. This is UIMA lingo for the text documents, word processor documents and audio recordings that make up your unstructured data. The first analytic (processing application) you might write to add structure to this data could look for locations, which are items you can often spot with software as long as the analytic has a database of countries, provinces, states, cities and so on, both current and historic.

As each document is analyzed, the analytic outputs artifact metadata. In this example, the output might be a collection of documents that are copies of the original, but with XML tags applied to the locations. Or, the tags may be created using the stand-off annotation model, created as external pointers rather than internal tags.

UIMA analysis applications tend to be strung together, each doing a specialized job. So the location analytic might then pass the results to a date analytic that looks for and tags dates and then the results might be passed to another that looks for and tags names. Each time through, more structure is added to the originally unstructured data, allowing the data to be processed, such as the automated building of historic maps and resources.

By now you might realize that creating small, specialized analytics allows for a heavy amount of re-use, which will hopefully encourage analytic tool sharing. A strong community will allow for an explosion of bringing structure to the unstructured and a richer ability to find and utilize the data created by the majority of our communications.

How do I Learn More?

"While OASIS has been developing the UIMA standard, the Apache Software Foundation has hosted an incubator project for UIMA-based open source software," noted Laurent Liscia, executive director of OASIS. "This is an exciting example of how open standards and open source development projects can complement one another to everyone's benefit."

To learn more about UIMA, how it's being used right now and its implications, see: