Enthusiasm for big data has moved beyond a trendy catch phrase to a serious force in shaping business decisions.
Recently, big data has been at the center of a rash of acquisition activity in the enterprise search and knowledge management industries. Among the most prominent examples are Oracle acquiring Endeca and IBM obtaining Vivisimo.
While these companies don’t deal with big data in ways that most people historically associate with the term, they add value and meaning to big data. What is that value and why is enterprise search technology being increasingly associated with the big data world?
Unstructured Data is Different
The current explosion of interest in big data is driven by the inclusion of unstructured data into the analyses. According to analyst consensus, 80 percent of data falls into the “unstructured” category, which is fundamentally different from “structured” data.
Computers produce almost all structured data, making it very consistent data that is perfectly formatted. Any inaccuracies will be programmatic and easily seen and solved.
However, unstructured content is largely created by humans: inconsistent, emotional, careless, opinionated, lazy, driven, over-worked, always unique, humans. Appreciating this difference in the origins of the data that we seek to analyze is the first step to producing actionable insight and business advantage.
It is All About the Insight
Deriving insights from large data sets is not a new idea. Large B2C companies, such as food retailers, have been doing this for years. They have been gaining insight from moving can of beans to different locations in the store or stocking cereal at different shelf heights and measuring what happens.
Online retailers use clickstream data to cross-sell more efficiently and to customize email marketing. This is a key component of Amazon’s highly successful business model and why you own a CSI: Miami box set in addition to your Law & Order collection.
Insights from these well-established practices are based entirely on structured data created by automated, transactional processes. The new challenge is how can we derive actionable insight from those pieces of unstructured data that do not easily organize themselves for evaluation or calculation.
Making Insights Actionable
Adding structure to unstructured data is the foundation of gaining insight. Like turning a lump of clay into a finished sculpture, this does not happen by accident. It takes strategically designed technology and targeted knowledge to overcome entropy and create order out of chaos.
This is one reason why big software companies are acquiring search engine companies. Vivisimo, Endeca and others have mature and highly capable “indexing pipelines” that add structure to big data content prior to indexing. These “indexing pipelines” are crucial for ensuring that the insights gained from your big data are accurate and reliable.
If the steps taken to add structure fall short (i.e. dates are not normalized, entity extraction is incomplete), then the accuracy of the data behind the insights becomes questionable. In politics, as we recently saw, if your survey is flawed, you are not going to gain accurate poll numbers.
In business, if your data structure is flawed, you are not going to gain actionable business insights. Worse yet, you might not realize your data structure is flawed and make misinformed decisions that hurt your business.
Technology + Process
While technology plays a prominent role in the process, adding structure to the unstructured is not just about the software. This is where the crossover from enterprise search to big data matters. Technology is leading the big data trend. But given the human nature of unstructured data, humans must be part of the solution.
It is the hands-on application of processes, pragmatism and checksums that produce the most value from unstructured data. A focus on transparency of process creates confidence in data provenance and enables actionable intelligence from unstructured data. That combination of technology and process is what is driving recent acquisitions and what can drive your business to make better, more accurate decisions based on your unstructured big data.
Editor's Note: Care to read more about the challenges facing unstructured data? Check out Barry Schaeffer's The Battle for Data Supremacy: The Cost of Ignoring XML
About the Author
Kamran Khan is the co-founder and current President and CEO of Search Technologies, an enterprise search implementation, consulting, and managed services company. Kamran has been developing, supporting, selling, and managing in the computer services/software industry for 25 years with a focus on search engine technology.