Since it's big and ingests lots of new kinds of things, Big Data makes many of its own rules. A new report from the Kimball Group attempts to present the best practices for dealing with this frontier.
The report, Newly Emerging Best Practices for Big Data (registration required) points out that Big Data delivers three "shocks to the system" of normal data processing -- the need to go outside standard relational database and SQL, the move to analytics and away from simple filters and aggregations, and the sharply increased value of faster performance.
Overall, the Kimball Group recommends structuring big data environments around analytics, not around ad hoc querying or standard reporting. This doesn't mean throwing out your existing environment, but extending it to allow for complex analytics that are defined by users, or to enable a metadata-driven development environment that can be adapted for the task at hand.
A key best practice cited in the white paper is understanding that what you're building now will probably not last very long, given how fast Big Data is evolving. "Plan for disruptive changes coming from every direction," the report advises, pointing to "new data types, competitive challenges, programming approaches, hardware, networking technology, and services offered by literally hundreds of new big data providers."
The best approach in such a rapidly changing landscape, according to Kimball, is maintaining a balance among a variety of implementation approaches that include the open source Hadoop, traditional grid computing, push-down optimization in a relational database management system, on-premises and cloud computing, and even -- gasp! -- your mainframe.
To acquire the tools you'll need, Platform-as-a-Service is suggested as a viable option. Apache Hadoop is an open source software that handles distributed processing of large data sets, and has become a key technology for Big Data.
Also, be sure to adopt a new attitude that loves change, with the assumption that you will be reprogramming and reposting all of your Big Data applications within a couple of years. The report also advises that a metadata-driven codeless development environment can increase productivity and possibly provide some insulation from the changes.
Given this period of change, managers of Big Data are also advised to "embrace sandbox silos and build a practice of productionizing sandbox results.” In other words, proofs of concept might be developed in a wide variety of languages and architectures as required by the data scientists figuring out new approaches -- and the IT department will just need to port the results to its environment when they’re done.
To get started, the report advises that IT departments "put your toe in the water with a simple big data application" of backup and archiving, using Hadoop as an inexpensive, flexible backup and archiving tool.
With environments changing, architectures change as well, but the Kimball Group has number of architecture-related best practices to recommend. One is planning for what it describes as a "logical data highway" with various caches of increasing latency, but implementing only those caches that work for your environment.
For example, raw source applications like credit card fraud detection have little or no latency, while real-time applications -- such as web page ad selection or some kinds of predictive monitoring -- have performances measured in seconds. Business activity applications, like trouble-ticket tracking or customer service portals, might have a latency measured in minutes. At the top end of latency, measured in days or periods, are all forms of reporting, ad hoc querying, historical analysis and other enterprise data warehouse applications.
Dimensionalizing the Data
Big Data analytics can be used to glean facts and insights, which can then be moved to the next cache, such as an analysis of unstructured text tweets to produce sentiment measures over time for audience engagement, conversation reach, active advocates and their influence, and other parameters.
Similarly, the white paper presents best practices for the logical and physical structure of the data. Dimensionalizing the data by assigning measureable parameters is the first step. As an example, the report notes that a single tweet of “This is awesome!” can be characterized, or dimensionalized, by customer location, product or service, marketplace condition, provider, weather, demographic cluster, triggering prior event and so on. Other best practices are specified for governance (meaning that Big Data governance is part of your overall data governance) and privacy.
While it acknowledges its debt to the world of enterprise data warehouses and business intelligence, this report makes clear that Big Data involves new data types and an explosion of channels providing data -- and, therefore, new opportunities to turn all that information into intelligence. Given the rapid pace of Big Data adoption, utilization and tools, this report makes an admirable first effort at laying down some guidelines for such an environment.
Image courtesy of Dirk Ercken (Shutterstock)