Rensselaer-leader.jpg
Researchers with the Rensselaer Polytechnic Institute (RPI) have thrown themselves head over heels into the semantic web. At this year's 8th International Semantic Web Conference, they presented a paper (download the PDF) focused on the problem of automatically generating the metadata that many semantic web functionalities rely on. 

Talking about RDF closures can cause eyelids to flutter, but the RPI folks aren't living in some theoretical world. They've been digging hard into large, public data sets and have learned some important lessons about making data more useful. Come join us for a closer look.

A Quick RDF Primer

Before I get into the tech, here's a quick primer for those who don't know their RDFs from their RFCs. The Resource Description Framework, or RDF, is a XML-based language developed for representing information about resources on the web. In other words, it allows you to embed metadata in a web page, where that metadata can refer to the page itself or parts of its content.

Ultimately, RDF is the heart of the Semantic Web.

The most important RDF concept to understand is that of the RDF Triple. A triple has, as one might guess, three parts: the Subject, the Predicateand the Object (Note: you will also see Subject, Property and Value -- don't fret, this is just another way of saying the same thing). An expression in RDF -- also referred to as an RDF Graph --  is a collection of these triples.

rdf-triple-2009-03.jpg

Resource Description Framework (RDF) Triple -- Subject, Predicate (or Property) and Object (or Value)

The meaning (or semantic value) of an RDF expression is that some relationship — defined by the RDF Predicate/Property — exists between the RDF Subject and the RDF Object/Value. In the end, that is as simple as the semantic web gets.

When using RDF, the contents of your triples come from your selected RDF Schemas (RDFS). An RDF Schema is used to define the vocabulary that can be used when building your triples. Remember that your subject is connected to your object with a property. That property needs to be defined clearly in the schema if meaning is to be clearly derived from each triple.

You can read more about RDF Schemas in the W3C RDF Primer.

RDF in Action

The idea behind RDF is to give us geeks a simple way to make statements about things on the web and have machines understand us. To bring this down from the ivory tower, let's look at an example.

Let's say I want to express that the website found at http://www.google.com was created by Larry Page. In this case we have the following 3 things that comprise our assertion:

  1. The subject: http://www.google.com
  2. The predicate or property: creator
  3. The object or value: Larry Page

Now to put this into RDF syntax, we do the following:

<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Description rdf:about="http://www.google.com"><dc:creator>Larry Page</dc:creator></rdf:Description></rdf:RDF>

Note that in the above example, the schema used to define the property 'creator' is the Dublin Core.

The Rensselaer Research

The Rensselaer paper digs into the problem of automatically generating an RDF Schema's closure. A closure in this context consists of all of the possible triples that could be generated by the rules and vocabulary within the associated RDF Schema. To take a simple example, if the only vocabulary defined was the 'creator', then once all items in the dataset had the creator RDF metadata defined, the closure would be compete.

Having software that can generate all the possible triples in a timely manner is vital if one wants to apply RDF metadata to large sets of information. With this project, the researchers had a schema that contained around 345 million triples plus additional rules. Using 128 parallel processes, their code managed to produce around 650 million triples in 8 minutes and 25 seconds.

Once all of the triples are generated, the semantic web magic can begin.

The Source: A Wealth of Government Data

Countries churn out a truly stupendous amount of data. Fortunately, more and more countries are realizing that one way to give the taxpayers something back for the money plucked from their wallets is to make non-sensitive data available to them.

One such country is the U.S. In particular, the RPI is using Data.gov, which offers "high value, machine readable datasets generated by the Executive Branch of the [U.S.] Federal Government."

Learning Opportunities

Through this site, you can access exciting information such as annual Federal Registers -- "the official daily publication for rules, proposed rules, and notices of Federal agencies and organizations, as well as executive orders and other presidential documents" -- and Toxics Release Inventories for various locations -- a  publicly available EPA database that contains information on toxic chemical releases and waste management activities, reported annually by certain industries as well as federal facilities.

You'll also find a collection of tools such as the Centers for Medicare and Medicaid Services Statistics Booklet and a Geodata catalog with information such as the Agricultural Minerals Operations map layer showing active plants and mines surveyed by the Minerals Information Team.

It's all enough to send a data junkie over the moon.

The Practical Semantic Web

Remember that each piece of data needs context before a computer knows what to do with it, and that that context is typically applied in the form of metadata. With this metadata in place one can build a semantic model of a dataset, where relationships are inferred by building out the closure. Once all of the triples that map to the rules from the schema are in place, the data can then be manipulated.

As the researchers at RDI work through their experiments, they're sharing their results on their Data-gov Wiki. This site offers a growing collection of demonstrations of not only what they've accomplished, but of the power of standards such as RDF.

Take, for example, their demo Changes of Public Library Statistics (2002 - 2008). On their wiki page you can see exactly which data sets they used, which technologies they used. In this case they utilized RDF, SPARQL, SPARQL Query Web Service, XSLT Web Service and the Google Visualization API.

They also offer a link to a live demonstration of their results. Here, you can click on almost any aspect of the chart to customize what information it presents and in what colors, and change the type of chart as well.

Rensselaer.jpg

View information about libraries over a span of 7 years, thanks to semantic web technologies.

Click the Play button and you see an animation showing how things changed from 2002 to 2008. At any point you can pause and look over a specific moment more closely.

Without the schema and the closure of that schema, it wouldn't have been possible to work with this much data in such detail.

No doubt work will continue on generating such closures even faster and with more precision, making it feasible to work with even larger data sets spanning perhaps more years or larger portions of the globe. They'll also have to find ways to process large amounts of data quickly without needing access to 128 processors -- not everyone has a cluster handy.

But for the moment, these experiments make for a pretty good "better than nothing", and they lend optimism and a little preview into the semantically enabled world that is yet to come.

[Editor's Note: For more down to earth semantic web coverage, see our article RDFa, Drupal and a Practical Semantic Web.]