Customer Experience Management (CXM), Information Management, Social Business
 
 
 

Solr 1.4 Offers Richer Document Indexing, Faster Search

Solr 1.4 Offers Richer Document Indexing and Speed

Any day now (if not already), the newest version of Apache's Solr (news, site) project hits the streets. A year in the making since the previous release, the Java-based, open source enterprise search server Solr 1.4 offers some exciting feature and performance improvements.

Faster, Better, Stronger

It's nearly impossible to talk about Solr without talking about the Lucene search library (news, site) it relies on. With the release of Lucene 2.9, Solr receives the boost of many under the hood enhancements through its partner in crime.

But Solr received plenty of attention for itself as well. Given how critical search is to helping users find exactly what they're looking for within a site, and to keeping them there by responding quickly as well as accurately, it's little surprise that improvements to Solr itself break down into performance and features.

Performance-based improvements to Solr 1.4 include:

  • Streamlined caching through a change to the Java class ConcurrentLRUCache, which minimizes the overhead of synchronization.
  • Scalable concurrent file access with a change to the Java platform's Java Nonblocking Input/Output (NIO) API to speed index file access.
  • Smarter handling of index changes through smart re-use of unchanged index segments.
  • Faster faceting through a new implementation of UnInvertedField for multi-value fields, providing in some cases 50 times faster performance.
  • Streaming updates for SolrJ through an optimized implementation of StreamingUpdateSolrServer, which is useful for indexing many documents at one time, in some cases producing dramatic document indexing speed improvements.

New Features

Perhaps the most exciting feature improvement to Solr 1.4 is the ability to index non-XML documents through the addition of Solr Cell, which uses the Apache Tika project to convert various documents to XHTML. Supported formats include PDF, OpenDocument (OpenOffice), Microsoft OLE 2 Compound Document (Microsoft Office), HTML, RTF, gzip, ZIP, and Java Archive (JAR) files. Solr now can detect duplicate documents by using unique signatures, and have a configurable response as to how these duplicates are handled.

An addition that will thrill Windows administrators and those who didn't enjoy filing a ticket with IT is a much smoother index replication process. Rather than needing administrator access and rsync on a Unix box, Solr 1.4 now offers replication in its Java platform layer, so you can perform backups the same way on any Solr instance on any operating system without having to go to IT.

Another feature was inspired at Lucid Imagination (news, site), a company dedicated to commercial-grade support, training, development, and consulting based on Apache Lucene and Solr—as well as a major contributor to the project. In using Solr 1.4 on their own site for months to test and spot other areas for improvements, they noticed that their combination of Drupal for the CMS portion of the site and WordPress for the blog posed some challenges.

Solr uses a technique called faceting to group search results by fields. With a site combination such as this, the idea of multi-select faceting, where you count and group search results according to their fields, became an obvious new addition for Solr 1.4. There are a broad number of use cases where this feature adds great power to Solr search. For example, you can:

 

Continue reading this article:

 
 
Useful article?
  Email It      

Related Articles:
Tags: , , , , , , , , , , , , , , , , , , , ,
 
 
 

Featured Events  View all | Add event | feed RSS

Who's Hiring?  View all | Post a job | feed RSS


 
Are you hiring?    Post your job today ($45 for 45 days)!