You would think we'd have big data all figured out by now.

Between tools like Hadoop, Spark and Kafka, concepts like machine learning and AI, and the efforts of highly paid data scientists, that's a reasonable assumption.

But as the volume, variety, velocity and variability of data grows, so does the need for tools that can capture it, cleanse it, plow through it and eventually deliver the insights we need to make game-changing decisions. Bottomless data lakes, infinite drives and data-crushing engines don't necessarily make us smarter.

We literally need to think outside of the box and beyond the cloud, as a small but growing number of vendors are doing now. Take a look.

Microsoft Buys Synthetic DNA

San Francisco-based genetics startup Twist Bioscience just sold 10 million long oligonucleotides (synthetic DNA) to Microsoft. Microsoft researchers will work with the DNA strands to explore new methods of data storage.

This might seem like a Google moonshot-like project, but it's a lot more practical. You see digital data stored on today's media has a relatively brief shelf life of two to 10 years. It periodically needs to be re-encoded. Data stored on DNA strands could survive as long as 1000 years, at least theoretically.

In an initial test with Twist, Microsoft researchers found they could encode and recover 100 percent of the digital data from synthetic DNA. They were pretty impressed.

"We're still years away from a commercially viable product, but our early tests with Twist demonstrate that in the future we'll be able to substantially increase the density and durability of data storage," said Doug Carmean, an architect in Microsoft's technology and research organization.

SnappyData Delivers Smarter Answers

Query a big data set and you'll have to wait a very long time if you want a highly accurate answer. And the more data you have, the longer you'll have to wait. Research conducted by CSC estimates data is growing 4,300 percent annually at least through 2020. So crunching every bit is not only impractical, but in some cases pointless, because the moment of opportunity could pass before the insight arrives.

Portland, Ore.-based startup SnappyData thinks it has an answer. It has created something called "Approximate Query Processing" (APQ) that gleans answers from a subset of the data and assigns a level of confidence rating to the results.

Incubated at Pivotal by three former Pivotal employees — Sudhir Menon, Richard Lamb and CTO Jags Ramnarayanan — SnappyData just closed a $3.65 million Series A round of funding from Pivotal, GE Ventures and GTD Capital.

Apache Kafka is Going Mainstream

Confluent, the commercial entity behind Apache Kafka, sponsored the first global gathering of Kafka enthusiasts at the sold-out Kafka Summit in San Francisco last week. Confluent co-founder Jay Kreps told the crowd something that CMSWire readers already know: Enterprises are becoming digital entities.

Kreps and his cofounder Neha Narkhede are betting that Kafka will be a component of modern enterprise infrastructure. Kafka acts like a central nervous system for enterprise data — it not only collects data at scale but it then makes it available in real time. It also has a processing engine so that you can derive value quickly.

Kafka is used at LinkedIn, for example, to let you know that someone has "liked" something you've posted. Uber uses it to predict when your ride will be arriving. Kafka could be key to the Internet of Things in the Enterprise as well.

Constellation Research analyst Holger Mueller told CMSWire that "Kafka has established itself in message digestion." What we're still waiting to see, he added, is "who wins the processing battle as the next logical step for next gen Apps is IoT, social media processing, interment crunching, revolutionizing human machine interface (bots!)"

Learning Opportunities

At the summit, Confluent unveiled a survey conducted by Researchscape which provides insights around Kafka adoption. It revealed that 29 percent of Kafka users hail from companies with $1 billion or more in annual sales and that 88 percent of the survey respondents indicated that Kafka would be a mission-critical part of their data and application infrastructure by 2017.

Does Google Cloud Dataflow Outpace Apache Spark?

Anytime a technology as hot as Apache Spark emerges, you know that something that claims to be hotter will be close behind. This is what happened with Spark and Hadoop (and it turns out that the two actually work together quite well), though Spark is clearly in the spotlight right now.

Fans of Google Cloud Dataflow say it is even faster than Spark.

It's not Google pushing this agenda per se. Rather, Durham, N.C. based Mammoth Data seems to be doing it for Google. The consultancy released the results of a benchmark study that claims Google Cloud DataFlow offers greater performance than Spark via work rebalancing and intelligent auto-scaling with zero increased operational complexity.

Mammoth Data researchers also claim that Google Cloud Dataflow is more developer-friendly, simpler to operate and easier to integrate.

Perhaps what's most notable about this is that Google Cloud Dataflow's API was recently promoted to anApache Software Foundation incubation project calledApache Beam. Developers will make or break Google Cloud Dataflow and Google knows that, hence its decision to go the Apache Way.

Quoble Open Sources StreamX and Quark

San Francisco-based Big-Data-as-a-Service (BDaaS) provider Quoble has open sourced its proprietary StreamX, a service that ingests the data logs from Kafka and persists it to cloud object stores such as Amazon S3. Earlier this month it open sourced, Quark, its SQL optimization project, to help simplify and optimize access to data.

What's the motivation? To share its goodness with the world, of course. But more practically speaking, to bring more users to its service. As Pivotal learned with its proprietary Hadoop offering, if you build it and it's not open source, they won't come. Even if it's really wonderful.

Free, On-Demand Stream Processing Training

Proprietary big data software and Hadoop vendor MapR seems to be defying the odds on the open source edict. How are they accomplishing that? Maybe by providing no cost training to want-to-be big data engineers in open source technologies such as Hadoop and Spark. Last week they announced they would provide some free training around Kafka as well.

Title image by Josh Felise