The Internet and social media have drastically affected traditional news and journalism business models. But smart news gathering organizations are are striking back, and open source tools like the Apache Lucene search engine are proving invaluable.

At the recent Lucene Revolution conference in San Francisco CMSWire interviewed Stephen Dunn, Head of Technology Strategy at British newspaper The Guardian. Dunn shared how The Guardian has reinvented content and news openness with their Apache Lucene-based Open Platform.

A New Business Model, Backed by Open Source

The newspaper industry isn’t the first market that comes to mind when most people think of sophisticated technology solutions.

However, the second-oldest newspaper in the UK is proving that the press can evolve and innovative in both technical and business realms. Using content, search, cloud computing and open source, the company has introduced a new business model: A news content platform.

The Open Platform consists of 3 tiers:

  1. Keyless: Take Guardian headlines and partner keeps all associate revenue.
  2. Approved: Partner takes full article content with advertisements. Guardian keeps ad revenue, partner keeps rest-of-page revenue.
  3. Bespoke/Custom: Partner takes full content, reformats, augments, etc. Revenue model is negotiated.

The Secret to Success


First things first, Stephen Dunn pointed out that the key to the Open Platform project's success was that from the beginning, it was run and perceived as a company-wide project, not a technology project hatched in the IT department. The project's vision was part and parcel to The Guardian's culture.

Dunn discussed the Guardian’s use of Apache Solr/Lucene in its Open Platform in the keynote address at the recent Lucene Revolution conference in San Francisco, and CMSWire was there to speak to him face-to-face:

Life Before Open Platform

The story of many a dated IT platform can be summed up in a word: spagetti. Prior to implementing the search-based Open Platform, The Guardian had no consistent way to provide content to partners. The newspaper used a variety of mechanisms from RSS feeds, batch jobs and even email to provide articles and multimedia to partners.

The organization needed a common API for providing information to the website and syndication sources. The platform would need to evolve over time with new content types and metadata, and the team wanted to be able to make changes to the schema without re-indexing the content or server outages. Additionally, although the website and API would consume content from the same repository, The Guardian needed to ensure API traffic would not affect website performance.

Open Platform: NoSQL = Search?

Enticed by fast, low-cost, on-demand scalable access to a large and diverse dataset, The Guardian moved from a relational database to a document-oriented Apache Solr/Lucene search-based architecture, deployed on Amazon’s EC2. The solutions boasts:

  • Full text search support
  • Support for non-down time schema changes
  • Support for multiple media types
  • Data replication, making it easier to scale horizontally across the cloud
  • Data filtering with facets, which are like SQL 'where'  clauses – a facet can be anything
  • Scalability to millions of documents

The Guardian has created an innovative news syndication platform that provides a new distribution channel for reaching new clients and markets. The Open Platform project give the public and partners access to all articles since 1999 -- more than 1 million articles -- via a common API using open source (explore the data here).

But The Guardian doesn’t just provide content -- it also delivers something called a Politics API (UK-only) and can even consume content from contributors using a lightweight REST API. They go further with an app injection model, whereby partners can inject functionality into The Guardian website itself. Chalking up points with global bloggers, the newspaper has also provided a WordPress plugin, allowing WordPress users to directly integrate Guardian content feeds.


The Guardian's Open Platform -- Leveraging Apache Lucene and Solr

Started in 2009, The Guardian's Open Platform project is not new, but new features and APIs are popping up all the time. Other traditional news providers  would do well to pay heed, and consider how they too can use technology -- open source or not -- to deliver content in new ways to engage a new generation of information foragers.

As for me, I think I'll coin a new buzzword and call this "News-as-a-Service" (NaaS).