It wasn't too long ago when open source software had a bad rep among Fortune 500 companies — which trickled down to smaller “Fortune 5000” companies and even smaller firms. They were not willing to risk betting the company on technology the big guys wouldn’t touch.
Times have changed. Now, I’d wager virtually every Fortune 500 and Fortune 5000 firm have open source technologies in active use. Open source, and companies based on open source technologies, are enjoying the trend. I thought I’d walk you through the technologies we see in our search practice, but first I have to mention a trend we’ve seen.
What's in a Name? From Enterprise Search to Insight Engines
Enterprise search emerged as a term to distinguish commercial and open source technologies from web search platforms. Internet-wide search platforms in the early days included companies like InfoSeek, Lycos, Altavista and many more, leading up to the introduction of Google in 1996.
The search vendor products intended for search inside the firewall started to appear a while later and included Ultraseek, Fulcrum Technologies, Verity, Vivisimo and more staring in the late '90s and early oughts.
The use of the enterprise search term continued until quite recently, even in reference for ecommerce search products. In fact, it’s so ubiquitous it’s hard to determine which company first applied the term — my guess is it was someone like Gartner, which would have adopted it for inclusion in its annual magic quadrant report in order to differentiate it from web search. But over the last few years, enterprise search has morphed into insight engines.
What is the difference between "enterprise search" and "insight engines"? My inner skeptic believes the former name suffered because of the technologies that were difficult to implement well, but also because vendors began to bundle additional technologies to integrate and extend the search capabilities. Whether a vendor licensed commercial technologies to provide a feature or provided integrated open source technologies didn’t really matter, because support was provided by the vendor.
These additions include a number of open source Apache Spark products to provide machine learning, aimed at enhancing the quality of the results presented to users.
Gartner has changed the name of its Enterprise Search Magic Quadrant to “Insight Engines,” which include many sophisticated features that integrate nicely with search. What do these capabilities provide, and what additional responsibilities fall on the insight engine team? As with enterprise search, a fair amount of effort is required if you want it to work well.
What capabilities are the vendors adding to these insight engines? Most will include the more advanced capabilities of today’s popular platforms: good search, ideally tunable; an excellent spider/crawler; manual and automated ‘best bets’; activity reporting; security; and reporting. But the new capability users expect is a search experience like Google and for that, machine learning is the technology that delivers the magic.
Related Article: When it Comes to Intelligent Search, Don't Expect Magic
A Round Up of the Usual Open Source Suspects
The Apache Project
It’s fair to say that without the Apache Software Foundation (ASF), our world would be significantly different.
ASF is the organization that oversees a broad range of open source technologies, from Ant to Zookeeper. I suspect ASF projects are in use around the world at software start-ups and banks, government agencies to law firms.
ASF boasts over 300 active projects, and easily a dozen of them are important to enterprise search today. Let’s take a look at a few of the more critical ones.
Lucene and Solr
Technically, these are distinct projects; but several committers are involved with both search technologies.
Think of Lucene as a low-level API that provides quick, relevant results. Solr, implemented on Lucene, is a search server: your application submits a query to Solr, which, through the underlying Lucene API, performs the query and results.
Lucene is a critical component of Solr. A number of commercial products including Elasticsearch and Lucidworks Fusion use Lucene and Solr respectively.
Spark, along with Lucene and Solr, are the most frequently used tools we work with, but increasingly Apache Spark is used to enable deep learning to create more relevant results based on user/visitor behavior. Essentially, Spark “learns” what documents are most often viewed for given queries, essentially “watching” user behavior to enhance result quality for subsequent queries.
Thus, when combined with search, Spark “learns” which page people look at the most after any specific query — thus, queries for “vacation schedule” will, over time, display the page that includes the holiday schedule.
What happens when that page changes? Consider product capabilities that change over time: how does the change get integrated into your search platform?
Fortunately, Spark includes what is essentially a rules engine that allows an administrator to alter the Spark-learned content. It requires some human interaction, but is worth the effort.
A key element of big data applications, Hadoop distributes large data sets — like search indices — across multiple servers. Hadoop has four primary components:
- Distributed File System (HDFS) — A distributed file system that provides better data throughput than conventional file systems along with high fault tolerance.
- MapReduce — A framework that processes input and processes data into formats that can be computed in key value pairs. Its output is processed (‘reduced’) to aggregate and provide the desired result.
- Yet Another Resource Negotiator (YARN) — A tool that manages and monitors the various clusters and nodes, and schedules jobs and tasks.
- Hadoop Common — S set of low-level Java libraries that are shared across the other three modules.
The Apache offers two web server tools, Tomcat and the older HTTP server. As with other technologies, enterprise search often relies on web servers for administrative and user interfaces. A number of search and big data tools, not to mention web content, rely on these servers to present content for users.
This Apache tool coordinates communication among and between many of the other Apache tools. At some point in its history, the collection of Apache Projects became known — presumably among the committers and other savvy users — as the Apache Zoo.
It was only logical — and someone whimsical — to call the tool that coordinates communications among the various Apache Projects, as the “ZooKeeper”; and the name has stuck to this day.
Related Article: Apache, What a Zoo
Apache Fuels the Search World
These key elements of the Apache Software Foundation enable the powerful search capabilities many of us use daily. Without these technologies, and the capabilities they deliver, many commercial and open source search and big data apps would not be as powerful and useful as they are today.