In spite of the claims by NOSQL proponents, the Information Week’s 2010 State of Database Technology study says 44% of technology leaders have never heard of NOSQL databases like Hadoop.
Technology leaders don’t need to rush to understand Hadoop, but because the problems that NOSQL solves has many benefits for organizations, they should begin to acquire the mindset and skills to understand the potential of these tools.
Why You Should Care
In many organizations, data is growing very big, very fast. Many companies are generating more data in a year than they generated in a decade previously. Much of this data is too raw and unstructured or complex (e.g. documents) to analyze easily using traditional relational database techniques.
Enterprises are also finding it difficult to scale their existing approaches for working on analytics tasks across terabytes or even exabytes of data. As they try to squeeze more insights from what is collected, new tools are required, and Hadoop has become a popular choice.
Technical cognoscenti are singing the praises of Apache Hadoop (news, site) -- an open source framework for managing rapidly growing stores of complex unstructured data. Thanks in part to US$ 36 million in funding, Cloudera (news,site), provider of Hadoop-based software and services, many non-techies are betting on Hadoop as well.
Companies like Disney, TransUnion and Google (news,site) are exploring Hadoop to do high-volume matching and categorization in very large, semi-structured or un-structured data sets. It is being used for all types of functions from fraud detection to understanding search behavior.
Companies are reporting that analysis that required over an hour and half to process has reduced to a minute and a half using Hadoop.
What Is Hadoop
Hadoop is a NoSQL tool based on MapReduce programming model from Google. MapReduce allows processing of very large semi-structured data sets in parallel on multiple computers. This allows processing of the data much faster for less cost.
What It Is Not
Hadoop is meant to complement existing data technology, not replace it.
- Hadoop works best for read-intensive processing as opposed to data that requires frequent updates.
- Hadoop is essentially batch oriented. Once it starts processing a data set, it can’t easily be updated. Thus, it is not appropriate for real-time uses. This may soon be a non-issue since Yahoo recently open sourced it's real-time MapReduce implementation, S4.
- Hadoop stores data unindexed in files. Finding something requires processing all the data. This means that Hadoop is not a substitute for a traditional database.
- Hadoop is not the tool to answer questions that require extensive recursion. The programming model simply does not support it well.
In relational databases, technologists must define columns and types of columns and then load the data. This reduces the speed with a data model can be evolved which can limit how quickly organizations can introduce new solutions or products.
Hadoop reverses this process. Data is loaded first and then query tools parse the data according to a schema when it is read. This strategy allows the columns that map to the structure to be extracted without concern for the other data. The approach increases flexibility and eliminates the need to normalize data first, which can be time consuming.
Hadoop is designed for distribution. This allows organizations to assemble a high-end computing cluster at a low-end price. Hadoop functions well on low cost hardware unlike some competing commercial database clustering technologies. Hadoop also deals well with node failure by making redundant copies of the data across multiple nodes as illustrated below.
Potential adopters should be aware that in some Hadoop implementations the computer that holds the metadata (NameNode in diagram) for managing the cluster is a potential single point of failure and implement an appropriate failover strategy.
The other cost consideration is product acquisition; most Hadoop implementations are open source and can be downloaded freely. Therefore, obtaining the tool and purchasing support are significantly cheaper than commercial products.
Complex Data Analysis
Hadoop is fundamentally different from traditional data warehousing approaches. In most business intelligence solutions, transactional data is scrubbed for accuracy and consistency, indexed and then the system is programmed to run queries against. This approach often requires a significant time and money investment effort up front.
With Hadoop, instead of bringing the data back to a warehouse, you first determine the question. Then data is MapReduced, which allows the data to be kept in unindexed raw format except what is required to answer the question. This approach has lower initial cost and risk. It is also beneficial for allowing analysis of data sets that aren’t highly structured and more complex than typical transactional data.
Supportability is a challenge with Hadoop. Technology leaders must consider this issue before moving forward with Hadoop. The average technical resource will not easily be productive with the tool. Cloudera is the only vendor that offers a Hadoop distribution with management tools and commercial support.
Only about 80 percent of the features are implemented in Hadoop. Organizations attempting to use Hadoop in an enterprise class application may feel it is more fragile and less manageable than traditional databases.
Data security was not a core architectural consideration in the design of Hadoop. It was later bolted on and is far from perfect. Organizations with regulatory requirements or just with a lot to loose from a breach a data should analyze extensively if the tools meets their needs before making an investment.
The NoSQL market is not mature and neither are the products. There is always a risk in blazing a technology trail. While multiple large companies are making forays into Hadoop and other NoSQL tools, there are still many lessons to be learned and mistakes to be made as they evolve. Technology leaders must understand this risk and decide if their organization can afford experimentation.
There are far fewer technology professionals proficient in Hadoop than relational database tools. Early adopters concur that the tools can be fairly challenging to use and manage. Organizations will like have to invest in training or new resources to make productive use of Hadoop.
As data continues to grow and become more complicated, technology leaders will be challenged with how to capture even more and how to deliver to the business in a manner that can help them see new opportunities. Forward-looking leaders will strive to make new business opportunities possible by using through technology and Hadoop might be one of those tools.
What do you think about Hadoop? Let us know in the comments.