Which search platform is the best for you and your business?
This question pops up repeatedly on LinkedIn and other discussion boards as well as with clients. Tasked with making content available for internal users and external customers, IT and business staff in companies large and small don’t want to make a mistake.
It’s often an expensive one, and the wrong choice could be a career-limiting decision.
A Look at the Search Market
In the world of commercial products, Google, Microsoft and Coveo come to mind. In the open source-based universe, Apache Lucene and Apache Solr lead the pack, and in fact power the top open-source-based commercial products. Elasticsearch, Lucidworks’ Fusion and Attivio Active intelligence Engine lead the pack.
Let’s take a look at the options.
Google Search Appliance
The big yellow Google Search Appliance boxes became popular in corporations, because it was from Google and the magic, mystery and popularity of the public Google search site made it the clear choice for searching intranet and public-facing web content.
Unfortunately, the enormous context and activity that makes Google.com so successful was not available for the GSA, but it provided reasonable search.
But in February, Google announced the "end of life" for the GSA. Since the GSA depends on a Google-provided software license file, when your license expires, your GSA becomes a nice bright yellow Dell server.
Another popular option is Microsoft SharePoint.
Microsoft acquired FAST Search in 2008 and shifted the technology, the intellectual property and other aspects of the FAST product into SharePoint search.
SharePoint search may not be the best bet for heavy duty search applications, but within the SharePoint universe, it’s a pretty decent search platform and is tightly integrated with SharePoint.
Coveo is in the unique position of providing a powerful Windows-based search platform which combines enterprise-wide search and predictive analytics that brings together all corporate content in what it calls the ‘Intelligent Workplace.”
With connectors to a large number of enterprise repositories and full document-level security, Coveo is positioning itself to the enterprise where many other platforms are targeted more at diverse repositories.
Attivio has wrapped a number of search-powered capabilities around a Lucene Solr base and markets its product as an "active intelligence engine." It recently positioned the technology as a Data Unification product.
Attivio, which has a number of former FAST employees, appears to be successful, but I don’t see them often in the enterprise search area.
In the open source works, the Apache Software Foundation is home to the Lucene search project, and is parent to the Apache Solr project.
The Lucene project is essentially a low-level search API that can be integrated into applications that need search capability. This includes programs like Content Management Systems or other document repositories where content is easier to find using strong search capabilities.
As an API, applications that use Lucene are responsible for indexing content as well as for searching and displaying results. If a new document is added into, or removed from, a repository, you are responsible for updating the index. And since your application is responsible for maintaining the indices, if your application crashes, the index doesn’t get updated.
If you need LDAP (Lightweight Directory Access Protocol), Active Directory or other document level security, your application is responsible for filtering results: by default, Lucene does not handle security filtering on its own.
Many solutions are out there for these and other shortcomings of stand-alone Lucene applications. And many organizations appear wiling to assume responsibility for integrating the required capabilities to save money — the licensing costs, if not staffing.
Lucene, as an Apache project, is open source and free. Like most open source projects, anyone can download and modify the source code, and submit suggested changes back to the project to be evaluated for inclusion in future releases.
Like other Apache (and other open source projects), a team of developers known as ‘committers’ consider submissions, maintain the code, fix known bugs and add new features. Anyone who shows interest and competence in writing and fixing code can potentially be invited to become a committer.
Elasticsearch, a product from Elastic.co, is another open source product available as a free download under the Apache Software Foundation license. Elastic accepts code fixes and enhancement submissions from others, but it appears that only employees of Elastic are permitted to be committers.
Some supporting capabilities such as the browser-based analytics and search dashboard Kibana and ingestion tools Logstash and Beats are also open source, and part of the Elastic Stack. However, many of its most powerful supporting features — security (Shield), monitoring (Marvel), alerting (Watcher), graph and reporting — are proprietary and available only for licensing in X-Pack.
Solr, a spin-off from Lucene and a sub-project under the Apache Lucene Project, is a search server. As a server, Solr is always running, includes the Lucene API, and performs tasks for applications that need search capabilities.
Solr has grown to be a quite complete and complex. It offers massive scalability, failover and redundancy with its Solr Cloud along with integration of another Apache project called Zookeeper. Search managers can add, remove or update content using simple commands sent to system ports.
And Solr comes with an easy to understand and manage console app which affords a high level view into Solr operations. Solr can be configured to continuously update the search index as content is added, updated or removed; and it not dependent on your application for scalability or updates. And it also supports frameworks to provide document level security.
A number of open source projects work with Solr to crawl websites and other repositories, and to support ‘binary’ document formats such as Microsoft Office and PDF documents.
Solr and its supporting tools do require maintenance and attention, but since the core applications are open source and free, many companies opt to invest in labor to install and maintain the search engine along with the search application that users work with to find content.
Lucidworks, which employs a large number of the Lucene/Solr committers, also markets a commercial search platform called Fusion.
Fusion falls into what I consider an “enterprise search” solution. While the underlying index is based on Apache Solr, Fusion adds:
- a powerful framework for ingesting content including documented indexing and query pipelines
- document level security with LDAP and Active Directory as well as a Kerberos authentication capability
- a powerful and easy-to-use crawler (ANDA) for indexing content
- document filters to ingest Word and other document formats and
- a connector framework for accessing content in dozen of the most popular enterprise repositories.
These tools securely index and search extremely large content repositories, and provide flexibility to maximize user capabilities. Finally, Fusion comes with Spark to provide the capability to define and utilize "signals," such as purchases or document views or downloads to tune relevancy automatically. (In interest of full disclosure, I was an employee for Lucidworks for two years prior to the introduction of Fusion).
Which Search Platform Should You Choose?
The question evokes discussions, heated debates and even Hatfield-McCoy-like discussions among search experts.
But there is a clear answer, though it might not be the one you want to hear.
As with many other software decisions, search, particularly enterprise-wide search, is not a “one size fits all.”
You need to consider your requirements including:
- the types and formats of the documents you need to search
- whether (and how) documents are secured
- what document formats do you use and how often they change
- how long or big are you average documents
- how many documents do you have
- where does your content live? SharePoint? File systems? DBMS?
- how skilled are your users? Library scientists or average web visitors?
- what kind of content do you have? Office Docs? Log files?
- what reporting capabilities do you need
Next month I’ll describe how you can go about answering these and other questions, and how you can figure out which platform is best for you. If you have specific questions, let me know.