aqua colored door surrounded by haphazard stacks of books
PHOTO: Eugenio Mazzone

Years ago, corporations only used open source software on rare occasions. Open source carried a certain stigma. The software was complex, poorly documented and was covered by what a friend of mine called “Bob’s warranty”: if you had the code, you were a big boy and were often on your own. While many of the open source projects provided great capabilities, to really take advantage of open source meant manually integrating a variety of different projects to deliver the heavy lifting that actually made the full product functional and reliable. 

Back then, commercial software was considered the safe way to go, because you could count on a predictable product with full support for a known — if pricey — cost. And no one got fired for selecting a commercial product. There were — and still are — a number vendors that delivered fully proprietary search platforms with excellent support including Microsoft SharePoint; Micro Focus, who offer Autonomy and who own rights to the older Verity and Ultraseek platforms; IBM Watson, based on the Vivisimo platform; Recommind from OpenText and others.

A number of open source technologies are available on the web as well, but the Apache Software Foundation (ASF) is the group behind more than 300 open source and free tools. The ASF projects of most interest for search nerds include search tools Lucene and Solr; file crawler Nutch; and machine learning (ML) tools Spark, MLlib and, to a lesser degree, Mahout.  

Open Source Makes Its Way Into the Enterprise

We are now seeing a new trend in the industry: product vendors integrating with selected open source projects to deliver specific capabilities, along with a licensed proprietary commercial platform that delivers the user experience. This lets software developers utilize the best open source technology under the hood for specific functions, along with smaller set of code that delivers the user experience and capabilities. 

Consider one of the leaders in the search field, Coveo. Its products have traditionally been proprietary from deep in the kernel all the way up to the user interface and experience. Not too long ago, Coveo announced the option of running its Coveo user interface on top of open source Elastic — which is based on Apache Lucene. With full support from Coveo, Elastic delivers the core functionality, while Coveo adds the value of a powerful crawler, rich management capabilities, an administrative console and a rich user experience.

Likewise, the Lucidworks Fusion product is built on several open source projects, including the Solr search platform and Spark. As with Coveo, Lucidworks has included a powerful, flexible and proprietary crawler, document level security, and a high-level console for managing everything.  

These hybrid platforms offer not only the technology of the open source world, but also the support, training and active communities that can make it easier to find help beyond the vendor community.

Related Article: Enterprise Search Is More About Vision Than Technology

Dipping Your Toe in the Open Source Search Water

Forward-thinking technology companies are taking advantage of open source tools to provide better search technologies — whether through an integration already completed by a commercial vendor, as in the case of Coveo and Lucidworks, or through a direct download and integration in house.

Solr and to a lesser extend Elastic are free for anyone to download and use. Solr is open source and free, but not all versions of Elasticsearch are available with an open source license. 

Getting Apache Lucene and Solr running can be a challenge. While downloads and documentation are available on the Apache Software Foundation site for both platforms, YouTube has thousands of video tutorials on installing Solr and other sites offer tutorials to walk you through the process. One caveat: Lucene and Solr have been around for a while, so be sure you are looking at material that is specific for the version of Lucene or Solr you are attempting to install.

But installation is only the first step: a number of related tools and products integrate with search — and these days they are the ones bringing machine learning into the equation.

Related Article: Enterprise Search Has an Open Source Secret

MLib and Mahout Bring Machine Learning Into Search

Beyond Lucene and Solr, a number of Apache tools integrate with and expand the capabilities of the search. Two mentioned earlier can be used to add machine learning: Apache Spark which includes the MLlib machine learning component; and Mahout, an older but quite capable tool that also provides machine learning components.

Apache Spark is a useful collection of several different capabilities including an SQL query capability, streaming analytics and a graph processing framework. A number of commercial products embed Spark specifically because of MLlib, which includes a very fast memory-based ML solution. And while search platforms that integrate Spark enable the connection with minimal effort without custom programming, the fact that Spark and MLlib can be accessed via a number of popular languages including Scalar, Python and R is often a plus.

Spark and MLlib are newer and faster than other available tools, which is one reason companies choose Spark over other solutions. However Mahout also can deliver ML capabilities in Lucene and Solr. Mahout is used primarily in collaborative filtering, clustering and classification. It's often a bit slower than MLlib when it comes to classification, but its additional capabilities can be useful.

If you've use internet search platforms like Google, Bing or DuckDuckGo, you’ve experienced machine learning at work in search. So you're already familiar with some of the capabilities ML can add to a public search platform. Now let's see how it would apply in the enterprise. 

Signals and Pattern Detection

When you search for a term and don’t click any of the results, the top search platforms observe the next queries you use, tracks which items you click on and then associates the terms. The assumption is that if the first results were not of interest, but you click on a subsequent search result, the two query terms are related. This also applies to synonyms and similar terms: machine learning doesn't make big changes, but over time it can "learn" related query terms through observation and refine the results it gives.

Related Article: When it Comes to Intelligent Search, Don't Expect Magic

Open Source Gets a Foot in the Enterprise Door

We've come a long way from the days when open source was a pariah in the enterprise. As the lines between proprietary commercial platforms and open source software become increasingly blurred across the board, but especially in the search realm, open source software will increasingly be found powering the back end of search. 

And as businesses become increasingly enthralled with the promise of machine learning as the ultimate solution for search that sucks, this trend will only continue. But a word of caution on machine learning: it’s not magic. Machine learning, whether implemented with Mahout, Spark, or some other technology, takes time to "learn" the best answer for each person executing the query. 

Fortunately, just as ML has the capability to learn which query results bring in the most views after a search, it also has the ability to evaluate and utilize data about the user — information that every enterprise has readily available. Do people in marketing search for different content than the engineering team? Are employees in France looking at different content than employees in New York? So in the same way that ML can learn query terms and results over time, it can also build "user profiles" which lets the user become just one more data point in generating a quality result list. It’s not perfect, but remember, it’s new.

Any good — or bad — stories on using open source or specifically open source machine learning in search? I'd love to hear them, so leave them in the comments below.