Lucene-leader.jpgHaving content that no one can find, not even your own users, is useless. Every Web CMS project and company has to face this issue at some point. What to do about search?

These days, many come to exactly the same answer.

Strategic Choices

When a project team sits down to design what they're going to build or implement, they have to look at each area and decide where to reinvent the wheel and where to stand on the shoulders of giants. Which decision is right for that project depends on many different variables, such as how much time they have, what kind of budget if any, how many people they have to do the work, and their target market.

Search is one of those big areas where time, budget, and work hours have to be balanced.

Lucene Not Just for Web CMS

Projects like Alfresco (news, site) knew for a fact that full-text search would be a cornerstone of their offering. Paul Holmes-Higgins, Alfresco's VP of Engineering, says that five years ago they also knew they had to get something out as quickly as possible. So, they started looking into what technologies already existed that would fit well into their planned top to bottom Java stack.

Important use cases were fleshed out. Both proprietary and open source options were tested, with the thought that they could always acquire a closed source engine and then open source it. Ultimately they felt they really only had one option, and that was Lucene (news, site), especially since its license was so flexible.

Of course, Lucene wasn't perfect. There were some options it didn't offer yet that they needed. So, rather than spending work hours on building an entire search solution from scratch, they extended Lucene and added features such as being able to have the search server be relaxed about transactions when indexing a site, rather than having to remain in lockstep. Such a feature was important for scalability.

When asked what he would choose today, he says that the answer would be the same. Lucene is so dominant that no one has created a viable alternative. Only specialized academic search projects are happening in that space. Though he also confesses that in their five years of existence, they've never considered switching. They just haven't needed to.

Many Vendors Reaping Apache Benefits

Grant Ingersoll, a member of Lucid Imagination's (news, site) founding technical team and a Lucene committer says that Drupal (news, site) and TYPO3 (news, site) also use Lucene, among many more, some in the form of Apache Jackrabbit (news, site).

In his own (unbiased, of course) opinion of why Lucene has become so dominant, he feels the important factors are that:

  • It's very stable
  • It has a proven search model
  • You can roll it out quickly
  • The APIs are easy to use, and made even easier with Solr
  • It has a strong and active community

Free is of course good as well, but Ingersoll says he's found that flexibility and being "white box" have turned out to be more important to those choosing Lucene instead of rolling their own or finding something else.

Pick Your Favorite Lucene

Another point in Lucene's favor is the flexibility of having a few flavors. For example, the Apache Foundation's Lucene.Net project is a port of the Java search engine to C# .NET (see NDoc here). PyLucene gives Python developers access to Lucene functionality via a Python wrapper around the Java Lucene foundation. And the Lucy project aims to deliver a "loose" port of Lucene to C, with both Perl and Ruby on Rails bindings.

Solr is the packaging up of Java Lucene as a ready to integrate enterprise search platform. It sports features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling and much more.

Content's always changing. I mean, beside the obvious, where you're editing what content you have and adding more, the formats change. New ones are introduced, whether to the world or your particular content management system. If your search engine can only index the file names, you're at risk of missing out on the power of searching across your content all at once.

Essentially, Lucene has made core search technology a commodity. Why roll your own when you've got great search speed, incremental indexing, metadata faceting and more for free on a solid platform?

Some do, but increasingly these days, many don't.