We are generating a massive amount of data. This isn’t just hyperbole. Social networking activity streams, sensor data, personal information collected to personalize a user experience — it’s all data. Almost every interaction that we participate in, whether online or in the "real world," results in some detail such as a log or sales transaction being stored. Innovative solutions, such as NOSQL, have emerged to help manage and analyze these increasingly large and complex pools of data, but in many cases, organizations need answers now, in near real-time; unfortunately, in many case this not a core strength of analytical big data technologies. DataStax is looking to change that.
Big Data, Right Now
Dear Virginia, yes, big data is real. Not only is the amount of data being generated staggering, but the pace at which it’s being generated is astounding, and it’s getting more complex. You don’t have to be a data scientist to understand that there is a lot of opportunity hidden in that data and its relationships. Extracting that value efficiently is why industry leaders are converging at conferences such as this week’s Structure Big Data and September’s O’Reilly Strata and gravitating toward open source solutions targeted at big data manipulation.
There are differing perspectives on how to maximize the value of big data, but most agree that high-latency analytical and low-latency real-time access are equally necessary. As the big data market continues to evolve, analytical solutions have matured faster than real-time techniques. A number of traditional data warehousing tools and newer technologies such as Apache’s MapReduce framework Hadoop and data warehouse infrastructure platform Hive are capable of performing deep analysis across massive datasets.
Now, real-time utilization of big data is gaining momentum as traditional relational technologies begin to buckle under the strain of providing sub-second access to growing volumes of data. A new class of data stores, termed NoSQL (not only SQL), has emerged to handle these challenges. While the advancements are positive, unfortunately the analytical and low-latency approaches are developing independently, forcing organizations to replicate data or settle for complex solutions with slower access.
DataStax saw an opportunity to simplify. The company, which provides products and services for Apache Cassandra — a highly scalable, low latency, distributed column-based NoSQL repository popularized at Facebook — has announced Brisk, an open source Hadoop and Hive distribution that integrates Cassandra.
A Big Happy NoSQL Family: Hadoop, Hive, Cassandra
The release of Brisk might lead to comparisons with the recent merger of Membase and CouchOne to form Couchbase. However, in our opinion, Brisk is more about technical convergence than it is about continued consolidation of the NoSQL market. Brisk will allow customers to reduce the time between when data is created and when it analyzed.
Brisk provides users with the benefits of Cassandra for serving up high volumes of data while Hadoop and Hive work against the same data at the same time. No copies. No waiting.
Brisk's low-latency big data analysis
Organizations no longer have to be satisfied with nightly ETL processes and batch analysis to get answers from their data. Activities such as allowing real-time data access to millions of users simultaneously, analyzing the data using Hive and feeding the resulting analysis back into the application become feasible. Brisk is an improvement over using these tools independently, and because DataStax has integrated the technologies, it should result in a simpler deployment effort.
How Can You Get It
Brisk is expected to be available under the Apache open-source license in about a month and a half.
Do you plan on trying your hand at Brisk? We’d love to hear about your experience.