"Someone told me it’s all happening at the zoo ..."
If you're old enough to remember Simon and Garfunkel, you may appreciate that 48 years ago this month, their song “At the Zoo” entered the Billboard Top 10. For the rest of you — the majority I'd expect — you probably think of the Apache Software Foundation (ASF). I'd suggest that for all of us, ASF has improved our lives.
A Trip to the Zoo
In 1995, ASF launched one of the earliest web servers for the new World Wide Web -- the Apache HTTP Server. The license they selected was open, developer friendly, and, with the Mozilla web browser from University of Illinois, was probably one reason the World Wide Web caught on so quickly. It's a shame the HAL 9000 didn't enjoy the same level of success.
Nonetheless, technical or not, nearly everyone in the world relies on an Apache project every day. In fact, ASF-based projects are so prevalent, and generally work so well together, that sometimes people refer to the entire suite of projects, when brought together to solve a problem, as the "Apache Zoo."
How so? Let me name a few of the projects that touch our lives every day in the zoo:
Perhaps the best-known web server in the world is the Apache Web Server, which probably drives more WWW sites than any other single web server. It is certainly the most popular on non-Microsoft platforms.
If you need massive scalability, chances are good you need one of the Apache search and search-related platforms: Lucene, Solr and Nutch.
Lucene was the original open-source search project started by Doug Cutting, who would go on to work on Nutch, as well as Hadoop and Avro and who currently acts as chief architect at Cloudera. Lucene provided an API for developers, who could add search to an application in a straightforward manner. It was quite functional and enjoyed wide adoption.
Soon, however, people realized that having a search API tightly integrated with your code meant that if your code crashed, search stopped working, and if Lucene ran slow or crashed, your app was out of commission.
The clear answer was to make a search server. Before long, Solr came on the scene, thanks to Yonik Seeley, who recently joined Cutting at Cloudera. Solr was eventually moved in as part of the Lucene project, and committers from both participate in ongoing development.
Solr is a service that runs in its own memory space, ironically, under a Jetty web server from the Eclipse Foundation (Heliosearch have released a version of Solr that utilizes the Tomcat server).
In either environment, you interact with Solr via web service calls using RESTful-like XML and JSON API calls. This means that Solr can add, update and remove content while your app simply calls Solr for queries. Because your code and Solr are independent, either process crashing does not impact the other -- directly, at least.
The final search element in ASF is Nutch, the command-line crawler released by Doug Cutting in 2003. Nutch acquires Web content by crawling sites and following links, and can scale to run on clusters of servers.
Apache: Not Just Search!
The limits that existed for enterprise search engines even a few years ago are today laughable. Verity, an early strong "enterprise search" engine, supported something like a million documents per index. At the time, that seemed huge.
Then the Web caught on outside of the nerd community. Now some companies generate hundreds of millions of records ("documents") every day. How can we possibly push all of that data into something that will let us do really big, interesting analysis?
Inspired by Google, Cutting and others jumped in and started working on what became Hadoop, to support the storage and retrieval of "big data" like logs and transactions. Hadoop is said to have been named because of a stuffed elephant that belonged to Cutting's son -- which explains Hadoop's elephant logo.
Pretty soon, all of the supporting technologies in and around Hadoop started up: HDFS, the Hadoop file system; MapReduce, which enables processing on massive data sets efficiently and in parallel across huge data sets; and now Spark which offers massively high speed upgrade to MapReduce. And this is only the start of the big data tools – which seem to be changing faster than anyone can really manage!
These are just a few of the growing number of projects under ASF. These free open source tools drive perhaps millions of servers worldwide to slice data in ways that have never been viable until now. Researchers use the tools to understand molecular interactions and how genetics impacts drug efficacy; and with the same tools, retailers are trying to understand what product you're most likely to buy on any web visit. And I'm sure governments are trying to understand how to prevent terrorist attacks.
The Elephant Handler and the Zookeeper
Apache Mahout, a project started by Ted Dunning, Grant Ingersoll and others, is a tool that enables recommendations and similarity discovery. It's useful in e-Commerce, medical research and many other fields. In many of these environments, Hadoop is the repository, so the name Mahout is quite appropriate: A Mahout is a person who looks after elephants.
The collected Apache projects are often referred to as the Apache zoo, and so the software used to manage clusters of many ASF tools running together is called: Zookeeper. It's all happening at the zoo!
Sure there are commercial products that do some of this. And there are a number of different open source software groups. But the Apache Software Foundation is behind many of the technologies that touch us every day.