- SharePoint 2010 - 5 Hot Features to Look Forward To
- Selecting a CMS: How to Build a Short List
- Alterian Drops Immediacy, Morello Web CMS Brands
- Installing SharePoint 2010 on Windows 7
- How SharePoint 2010’s Metadata Services Increase Usability
- Gartner’s Top 5 BPM Predictions for 2010 And Beyond
- Open Text Reports Good Q2, Vignette Contributes to YoY Spike
OASIS Approves Open Standard for Unstructured Information Access
The majority of data on the web and elsewhere is unstructured, meaning that you cannot make assumptions that easily break the data down into components for processing. So what do you do when you need to process such information?
OASIS, the Organization for the Advancement of Structured Information Standards (news, site), has approved Version 1.0 of the Unstructured Information Management Architecture (UIMA), which allows some level of structure to be assigned to the documents, email, speech, images and video produced during human communication.
According to OASIS, unstructured information is the largest and fastest growing source of knowledge for governments and businesses — an estimated 80% of the information generated in the world. It is also the most current, which means you really need to be able to get access to it to run your business well.
While most of our attention has been on the OASIS CMIS standard process, another critical standard has been in development to support accessing unstructured data.
UIMA 101
UIMA is not a markup language. You make no individual changes to your blogs or other unstructured content. Instead, you build applications that analyze the content looking for specific semantics. The application then marks up the content for you — perhaps with XML tags — adding structure to your unstructured data.
For an example, say that you have thousands of history-related artifacts. This is UIMA lingo for the text documents, word processor documents and audio recordings that make up your unstructured data. The first analytic (processing application) you might write to add structure to this data could look for locations, which are items you can often spot with software as long as the analytic has a database of countries, provinces, states, cities and so on, both current and historic.
As each document is analyzed, the analytic outputs artifact metadata. In this example, the output might be a collection of documents that are copies of the original, but with XML tags applied to the locations. Or, the tags may be created using the stand-off annotation model, created as external pointers rather than internal tags.
UIMA analysis applications tend to be strung together, each doing a specialized job. So the location analytic might then pass the results to a date analytic that looks for and tags dates and then the results might be passed to another that looks for and tags names. Each time through, more structure is added to the originally unstructured data, allowing the data to be processed, such as the automated building of historic maps and resources.
By now you might realize that creating small, specialized analytics allows for a heavy amount of re-use, which will hopefully encourage analytic tool sharing. A strong community will allow for an explosion of bringing structure to the unstructured and a richer ability to find and utilize the data created by the majority of our communications.
How do I Learn More?
"While OASIS has been developing the UIMA standard, the Apache Software Foundation has hosted an incubator project for UIMA-based open source software," noted Laurent Liscia, executive director of OASIS. "This is an exciting example of how open standards and open source development projects can complement one another to everyone's benefit."
To learn more about UIMA, how it's being used right now and its implications, see:
- The OASIS Unstructured Information Management Architecture (UIMA) Version 1.0 Standard
- The Apache Software Foundation's UIMA incubator project
Be the First to Comment
From our Job Board View all jobs
|
Jobs RSS feed
| Post a job right now
- Web Dev Badass at InterWorks
- Front-end Engineer at isocket
- Platform Architect at MyWire
- IT Business Development Manager / Sales Executive at ISIS Papyrus
- Product Support Engineer at Digitech Systems
- UI Designer at Mochi Media
- SharePoint Developer at Metalogix
- Database Kernel Architect / Technical Lead at Quantivo
Featured Events View all events
|
Events RSS feed
| Add your event
- Feb 17, 2010 – Webinar: 4 Essential Strategies for Advancing Your Website's Business Impact
- Feb 26, 2010 – Intelligent Content 2010
- Apr 21, 2010 – Drupalcon San Francisco 2010
- May 5, 2010 – CMS Expo 2010 (Evanston)
- Oct 7, 2010 – HartmanEVENT 2010 - Social Media & Mobile Usability

Get the Newsletter
Email It
Stumble It
Add RSS
Processing...

