Apache Lucene 2.9.0 Offers Performance Optimization

3 minute read
Dee-Ann LeBlanc avatar

Apache Lucene 2.9.0 Offers Performance Optimization
Apache Lucene (news, site), the popular open sourcesearch engine, is now available in version 2.9.0. This version is the last minor release before the big step to Lucene 3.0, but the developers didn't hold back all of their improvements for later.

The Quick Breakdown

The last public release of Lucene was 2.4.1. Version 2.9.0 improves upon its predecessor with:

  • Per segment searching and caching
  • Near real-time search with IndexWriter
  • New Query types
  • Smarter and more scalable multi-term queries
  • An optimized Collector/Scorer API
  • Improved Unicode support
  • The addition of Collation contrib
  • A new Attribute based on the TokenStream API
  • A new QueryParser framework in contrib
  • A core QueryParser replacement implementation
  • Optional scoring when sorting by Field or using a custom Collector
  • New analyzers
  • New fast-vector-highlighter for large documents
  • High-performance handling of numeric fields

The Lucene team warns that backward compatibility changes should be expected, review the change in policy on this topic to see how your code might be affected. To ensure that you aren't surprised by such issues, the Lucene team suggests compiling 2.9 into your application rather than trying to just drop it in, so you can see any resulting errors.

For a full list of what's different in 2.9.0, see the changes document.

An Example: What These Changes Mean

According to Grant Ingersoll, an Apache Lucene/Solr committer and co-founder of Lucid Imagination, in general "most applications in most situations" should be faster in 2.9.0 than they were in 2.4.1. Still, he recommends testing rather than just taking his word for it in case you're out on the tails of the bell curve.

Learning Opportunities

One example of a 2.9.0 change that can greatly speed performance is the ability to now cache and search per segment. A segment in Lucene stores information about terms, positions, what fields and documents are stored and so on. Each segment is a sub-index, with an index consisting of a collection of segments (typically the segments are files within an index directory).

Segments are written incrementally and in most applications, the segments typically don't change. 2.9.0 takes advantage of this fact by operating more per segment than per index. One method used to speed performance uses the FieldCache. The FieldCache holds field information such as the terms the fields contain, and is typically used for tasks such as sorting.

The new version of Lucene now handles the Field cache on a per-segment basis. Due to this change, Lucene 2.9.0 is smarter about when it looks for changes to the FieldCache contents since anything changed in your index is probably new, not changes to previous segments.

For more details on what these changes mean to you, read the overview document and/or watch the Apache Lucene 2.9 webinar.