Researchers with the Rensselaer Polytechnic Institute (RPI) have thrown themselves head over heels into the semantic web. At this year's 8th International Semantic Web Conference, they presented a paper (download the PDF) focused on the problem of automatically generating the metadata that many semantic web functionalities rely on.
Talking about RDF closures can cause eyelids to flutter, but the RPI folks aren't living in some theoretical world. They've been digging hard into large, public data sets and have learned some important lessons about making data more useful. Come join us for a closer look.
A Quick RDF Primer
Before I get into the tech, here's a quick primer for those who don't know their RDFs from their RFCs. The Resource Description Framework, or RDF, is a XML-based language developed for representing information about resources on the web. In other words, it allows you to embed metadata in a web page, where that metadata can refer to the page itself or parts of its content.
Ultimately, RDF is the heart of the Semantic Web.
The most important RDF concept to understand is that of the RDF Triple. A triple has, as one might guess, three parts: the Subject, the Predicate and the Object (Note: you will also see Subject, Property and Value — don't fret, this is just another way of saying the same thing). An expression in RDF — also referred to as an RDF Graph — is a collection of these triples.
Resource Description Framework (RDF) Triple — Subject, Predicate (or Property) and Object (or Value)
The meaning (or semantic value) of an RDF expression is that some relationship — defined by the RDF Predicate/Property — exists between the RDF Subject and the RDF Object/Value. In the end, that is as simple as the semantic web gets.
When using RDF, the contents of your triples come from your selected RDF Schemas (RDFS). An RDF Schema is used to define the vocabulary that can be used when building your triples. Remember that your subject is connected to your object with a property. That property needs to be defined clearly in the schema if meaning is to be clearly derived from each triple.
You can read more about RDF Schemas in the W3C RDF Primer.
RDF in Action
The idea behind RDF is to give us geeks a simple way to make statements about things on the web and have machines understand us. To bring this down from the ivory tower, let's look at an example.
Let's say I want to express that the website found at http://www.google.com was created by Larry Page. In this case we have the following 3 things that comprise our assertion:
- The subject: http://www.google.com
- The predicate or property: creator
- The object or value: Larry Page
Now to put this into RDF syntax, we do the following:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://www.google.com"> <dc:creator>Larry Page</dc:creator> </rdf:Description> </rdf:RDF>
Note that in the above example, the schema used to define the property 'creator' is the Dublin Core.
The Rensselaer Research
The Rensselaer paper digs into the problem of automatically generating an RDF Schema's closure. A closure in this context consists of all of the possible triples that could be generated by the rules and vocabulary within the associated RDF Schema. To take a simple example, if the only vocabulary defined was the 'creator', then once all items in the dataset had the creator RDF metadata defined, the closure would be compete.
Having software that can generate all the possible triples in a timely manner is vital if one wants to apply RDF metadata to large sets of information. With this project, the researchers had a schema that contained around 345 million triples plus additional rules. Using 128 parallel processes, their code managed to produce around 650 million triples in 8 minutes and 25 seconds.