When we talk about the Semantic Web we mean more meta-information hidden in the page code, but derived from the content itself, with the aim of letting Web services and search engines know exactly what's there without having to guess from keywords and tags. XML is one format which can structure content to contain more classification material. RDF is the preferred data model used, which splits content into entities and relationships, and the RDF model most usually utilizes XML to structure content.Paul Wlodarczyk of JustSystems thinks it's closer than we think. He's written a great post about the Semantic Web, focusing particularly on how XML can take ambiguity out of search and enable other semantic advantages. He likens the 'old' Web to one where the wisdom of crowds prevails (more backlinks equals better content), and the RDF-structured Web as one where the 'wisdom of authors' wins; '...who can let the crowds know — in no uncertain terms — what their content means.' So a post on a New York sports team involved in a trade with a L.A. sports team (using Paul's example) is unambiguously about the Knicks and the Lakers, or whatever. More importantly, using RDF/XML relationships between those entities can be formally identified. So a player-trade between those two teams can be tagged as such, and this relationship-enabling for Web content enables a whole new world of potential Web usage, not just in search but in content mashups, Web monitoring and intelligence, social marketing, targeted advertising etc.There are three ingredients needed to enable and popularize the semantic Web, so let's look at them, with a little help from Wlodarczyk, and see if we really are on the cusp of a content revolution.
The Recipe for the Semantic Web
1: Markup Technology
XML and RDF: Check. XML has been around forever, is mature, is familiar to developers, optimized for the Web and is easy to learn. The rapidly evolving XMBRL will hopefully bring classification and order to financial and business data. W3C's RDF framework doesn't have to work off XML, but it looks like the winning ticket.
2: Transforming Content and Legacy Content to Rich Formats
Check. The key here is technologies like Calais which will auto-tag content and render it as XML. You can test this for yourself: go to Clear Forest's tag-generator and XML file rendering service (it has an open API) to check it out. You can paste text or import a file, see how the content has been tagged, and save the output as an XML file. Interestingly, Clear Forest was bought by Reuters last year.The next step here may be unleashing an army of crawlers, which might trawl through your content rendering it as XML and perhaps mirroring it. Perhaps a shadow Web might emerge, with legacy content living both as HTML and in a new location as XML, the latter to be optimized for the new search technologies and Services?Flights of fancy aside, the last time we talked with Acquia (commercial Drupal distros, co-founded by Drupal founder Dries Buytaert), they were talking about bundling Open Calais with their first commercial Drupal package, and there's already a Drupal Calais module.
3: Market for Semantic Markup
Check -- maybe. There are a bunch of semantic search engines out there, and looked at a fair portion of them in our appraisal of current search technologies last week. None of them are threatening to imminently usurp Google, though, for mainstream Web search.But then there's Yahoo! SearchMonkey, which we also covered in that article and elsewhere at this fine content management resource, which is still in development and which is getting Wlodarczyk and JustSystems pretty excited. SearchMonkey is "...a new technology that rates web pages. Rather than use keywords and number of links to the page (the wisdom of crowds), SearchMonkey finds web pages using the semantic markup that is embedded in the page (the wisdom of authors). This creates a substantial motive for adding semantic markup — search engine optimization." The key point is that Yahoo! still packs some muscle, and particularly since the platform will enable Web services, SearchMonkey may well be the killer application which convinces content generators that they should get their content marked up.We're pretty close to a richer Web, there's no doubt. But whetherin five years Web content will be written in XML/RDF as a matter of course is open to debate. But hell, that's what comments boards are for. Tell us what you think.We may be doing the company a disservice, but content seems to come out of JustSystems from Jake Sorofman, Paul, etc. in the form of email bulletins and there doesn't seem to be a corporate blog. Which is why we can't give you the permalink to Paul's post just yet. This is more than vaguely ridiculous, not least because the content from the guys there is exceptional. [nb we got it in the end -- see UPDATE below]For God's sake JustSystems: get your blogging selves sorted out, stick up a proper blog to spread the word and include permalinks in your email bulletins. We'll post a permalink to Paul's post as and when we get it. Or perhaps we're just blind and can't see the blog at JustSystems.com. In which case, um, sorry.[UPDATE: We eventually found the JustSystems blog. It was buried in the 'About' page, with a link to 'Executive Summaries' at the top. It was a real 'Where's Waldo' adventure trying to find a permalink, and they should really do something about this.]Here's the link to Paul's post (nasty URL).
About the Author
John Conroy is a PhD candidate at the National University of Ireland Galway where he hacks with python and researches information retrieval and data mining, social graphs and interaction graphs. Connect with John Conroy: