The BBC’s website for the 2010 World Cup was notable for the raw amount of rich information that it contained. Every player on every team in every group had their own web page, and the ease with which you could navigate from one piece of content to the next was remarkable. Within the Semantic Web community, the website was notable for one more reason: it was made possible by the BBC’s embrace of Semantic Web technologies.
In the first two articles in this series on the Semantic Web, we first looked at an overview of the Semantic Web and then looked at Semantic Web technologies in detail. This time around, I’m pleased to share the story of the BBC’s adoption of Semantic Web technologies to help power BBC Online, as related to me by Yves Raimond (BBC Senior R&D Engineer), Michael Smethurst (BBC Senior Content Producer) and Olivier Thereaux (BBC Senior Technologist).
Lee Feigenbaum (LF): When did the BBC first start using Semantic Web technologies? When did the technologies first move to production?
Yves Raimond, Michael Smethurst and Olivier Thereaux (BBC): It's difficult to pinpoint an exact moment when the BBC first started to use Semantic Web technologies. It was more something we have evolved toward from a shared approach and shared philosophy. We have been thinking in Linked Data terms for seven or eight years without necessarily using specific technologies. A rough chronology would be:
- 2004: Around 2004, work started on PIPs (programme information pages), which aimed to create a Web page for every radio programme broadcast by the BBC. This began our approach of using one page (one URL) per thing and one thing per page (URL).
- 2005: Tom Coates published "The Age of Point-at-Things," a blog post filling out some of the thinking behind giving things identifiers and making those identifiers HTTP URIs. Also in 2005, BBC Backstage was launched as an attempt to open BBC data and build a developer community around that data.
- 2006: Work began on /programmes, a replacement for PIPs covering both radio and TV. Around the same time we bought -- in bulk -- copies of Eric Evan's "Domain Driven Design" which influenced the way we designed and built websites to expose more of the domain model to users. Building on Backstage, we added data views to /programmes (JSON, XML, YAML, etc.).
- 2007: In 2007, we started work on rebuilding /music as a way to add music context to our news and programmes. Because we didn't have our own source of music metadata we looked for people to partner with and settled on MusicBrainz because of their liberal data licencing. Previously we had silo’ed micro-sites for programmes and music. By stitching MusicBrainz artist identifiers into our playout systems we linked up these silos and allowed journeys between /programmes and /music. At the same time as we started to consume open data, we also started to publish Linked Open Data, creating the Programmes Ontology and adding RDF to both /music and /programmes. At the time, we found it much easier to develop separate but related applications in a loosely coupled fashion by dogfooding our own data: /programmes uses data views from /music and vice versa.
- 2008: We rebuilt more of bbc.co.uk (/nature and /food) according to domain-driven design and Linked Data principles, publishing a Wildlife Ontology and RDF for /nature. Again we borrowed open data to build a framework of context around our content: this was the start of us using the web as our CMS and the web community as our editors.
Up to this point we'd published ontologies and RDF and also consumed RDF, but we were still using relational databases (rather than triple stores) to serve websites.
- 2010: Published the World Cup website using a BigOWLIM triple store [LF: a triple store is a database that stores RDF data]. News articles were tagged with entities in the triple store and inference used to propagate those tags to all relevant entities through the graph.
- 2011: Rolled out the World Cup approach across the whole of BBC Sport.
- 2012: Rolled out the Olympics site using the same model as BBC Sport.
LF: Could you describe the main use cases of Semantic Web technologies at the BBC? Would you characterize these use cases as “dynamic content publishing”?
BBC: Our use of Linked Data breaks down into three areas:
[LF: the term Linked Data refers to a specific set of best practices for working with Semantic Web (RDF) data; the term Linked Open Data refers to Linked Data that is freely available on the Web.]
- Publishing Linked Data: to make our content more findable (e.g. by search engines) and more linkable (e.g. via social media or by other Linked Data publishers using the same vocabularies and identifiers);
- Consuming Linked Data: to “borrow” additional context for our content where we don’t have existing data and want to cut content by specific domains (music, nature, food, sport). The Linked Open Data that we use also helps give us additional links between domains.
- Managing data internally as Linked Data: to maximize the use we get out of editorial input by propagating editorially added links across data graphs; to make more links between otherwise siloed sites.
It’s not really accurate to call our use cases “dynamic content publishing.” Our actual content (TV and radio programmes and news articles) is still fairly static. The Linked Data / Domain Driven Design approach is less about dynamic content and more about dynamic context and dynamic aggregations around that content that let us maximize exposure to our content by placing it in different contexts (wildlife, music, food, football, etc.).
- Because bbc.co.uk has content in so many domains, it’s like a microcosm of the web. One of our goals with this work is to move from a set of silo’ed sites to a coherent service which we can only do if our content is well described and interlinked. Finally, by using domain-native URL keys we can generate more inbound link density and make our content more findable on search engines.
LF: How did the BBC produce these sites before the Semantic Web approach?
BBC: By hand. There was lots of hand-rolling of microsites around specific items. There was lots of aggregations maintained by editorial hands. The Semantic Web approach meant that we could provide many more aggregations and many more routes to content at lower cost.
LF: Have you been able to measure any results from these efforts?
BBC: For the Olympics, the Dynamic Semantic Publishing (DSP) architecture allowed us to offer a single page for every country (200+), every athlete (10000+), every discipline/event (400-500) and every venue. All of these pages were complete with aggregated relevant stats and news.
[LF: A blog entry about the 2010 World Cup site indicates that it has over 700 pages for teams, groups and players. This was an amount of content that never would have even been considered without the automated Semantic Web approach. The same blog entry puts this into perspective: “The World Cup site had more index pages than the rest of the [hand-edited] BBC Sport site in its entirety.”]
LF: Were there particular things that you learned from the World Cup and were able to change for the Olympics?
BBC: The World Cup site worked. Everything that we learned at a relatively small scale from the World Cup site could be applied to the Olympics, which was an order of magnitude more complex.
In the context of adapting our architecture to the complexity of all of the different Olympic events and disciplines, we made one significant change: We added a MarkLogic XML database. In the words of Senior Technical Architect David Rogers:
Fundamental to this approach was the use of MarkLogic to store and retrieve data. MarkLogic is an XML database which uses XQuery to store and retrieve data. Given the timescales, this project would not have been achievable using a SQL database, which would have pushed the design towards more complete modelling of the data. Using MarkLogic, we could write a complete XML document, and retrieve that document either by reference to its location, a URI, or using XQuery to define criteria about its contents."
LF: Are there other uses of Semantic Web technology not related to content publishing that are being explored within BBC?
BBC: We are currently exploring various other uses of Semantic Web technologies within BBC R&D. In particular we’re looking at ways in which Linked Data can be used to help search and discovery of archive content. We have been working on automatically identifying the topics and the contributors for BBC programmes from their content, using a combination of Linked Data, signal processing, speech-to-text and Named Entity Recognition technologies, which we have been talking about in various places, such as the Linked Data on the Web workshop and at WWW’2012. The automatically generated links from programmes to entities described in the Linked Data cloud might be incorrect in places, so we are also exploring how users can validate or correct those links, and how this feedback can be taken into account within our automated interlinking workflow. We are planning to write in more details about our experiments in that space on the BBC R&D blog in the next couple of weeks.
LF: What are your plans going forwards?
BBC: We are currently annotating quite a lot of our content with Linked Data URIs to drive a number of aggregations on our site, but we are making little use of the connections between all these URIs. So far, we have only been using those in our automated tagging tools, to disambiguate between candidate identifiers. There is a big opportunity in using those connections for storytelling purposes -- using paths in that graph of data to help tell stories around our content. It becomes even more of an opportunity if we start describing the content of individual programmes in more details, such as describing the narrative structure of dramas, for example. We started some investigation in that area in our Mythology Engine project, but there is much more that could be done.
I think there are several lessons to learn from the BBC’s experience with Semantic Web technologies:
- Embracing these technologies was an evolutionary process; it started with a general philosophy, rolled out incrementally, and ended up providing a significant strategic advantage.
- The BBC invested a great deal of energy in being able to clearly articulate the vision and the value of the Semantic Web approach on their various blogs, and in doing so sought to engage a much larger community beyond the BBC.
- Semantic Web technologies are not an end in themselves. While they play a crucial role in what the BBC has accomplished with dynamic site publishing, there are many other technologies (such as XML, Silverlight and standard HTTP) that need to come together for this application.
My thanks to Yves, Michael, and Olivier for taking the time to contribute their experiences for us all.
Editor's Note: To read the beginning of Lee's series on the semantic web, The Semantic Web and the Modern Enterprise.