Sorry folks, but this shouldn’t come as too big of a surprise. Anytime a new technology or field emerges, so does a group of posers. They’re typically software vendors, consulting firms and “experts” who claim to be able to help you cross the chasm between where you are and where you need to go to remain viable in the future.
These aren’t, for the most part, evil companies, snake oil salesman or under educated individuals. Vendors iterate products as quickly as they can and push them out too early, they take shortcuts and rationalize them and sometimes they simply don’t know that they don’t know what they’re doing.
Big data is still an emerging field.
Talking the Talk
When it comes to Hadoop engineers, MongoDB, Couchbase and Datastax developers, data scientists and such, some genuinely believe that their experience and innate abilities, paired with reading a “for dummies” book and a little luck is enough. It’s enough to learn to “talk the talk”, to be dangerous, and to potentially put a project at risk if you ask us.
There are vendors, solution providers and individuals in these scenarios, and in the real world, who are “faking it."
And though “faking it” is by no means an intended theme at the O’Reilly Strata and Hadoop World conference being held in New York City this week, it’s a common thread we keep finding during interviews and presentations.
We heard it first from Platfora’s vice president of marketing, Viviana Faga, as we were ending a briefing call concerning news her company would be making at the conference. “What do you think about Salesforce’s Analytics Cloud, “Wave”?” we asked.
“It’s fake big data,” was her answer.
We were taken a back. We warned her that the comment was on the record. She wasn’t the least bit concerned. Platfora’s CEO, Ben Werther, went on to explain that some vendors, who claim to provide platforms from which powerful, big data informed insights could be gleaned weren’t big data based at all. “It’s like making decisions based on what you see in the tip of the iceberg, instead of the whole iceberg,” he said. “Why do that when it’s possible (and affordable) to see the whole iceberg?”
From Platfora’s point of view, analyzing a part of the data is not big data. The whole point of big data is storing everything, enriching data when it’s called for, blending data, analyzing data, and gleaning rich, actionable insights from it. Otherwise you’re selling yourself, and the whole promise of big data, short.
And what’s bad about that, aside from wasting precious time and resources, is that in this day and age, the enterprises that truly leverage big data will gain a significant competitive advantage over those who don’t. So “fake big data” could literally put not only your company and customers at risk, but also affect your job. We all know project owners who have gotten canned or had their careers dead-ended for failed projects.
Now that big data is being embraced by the masses, we will no doubt start hearing about how big data doesn’t live up to its promises. But it may not be that big data that’s to blame. Instead “fake big data” could be the culprit.
Faking it with Statistics
John Rauser, a data scientist at Pinterest, made a compelling presentation, “Statistics Without the Pain” on the Strata stage. He opened with a rather alarming sentence.
“I wrote this talk,” he said, “because I suspect that many people in this audience (data scientists with engineering backgrounds) are faking it when it comes to statistics.”
He was received with laughter, which probably means that at least a few of the folks in the audience identified. Rauser went on to describe how a data scientist with an engineering background might react to a conversation with a data scientist/statistician who was going on and on about tools like power analysis, generalized linear models, two tailed tests …
“You nod your head, you play along, but you really have no idea what they’re talking about,” he confessed.
It doesn’t have to be this way, he said, and went on to give the data scientists in the room an out.
“Engineers should look closely at what they are studying, and translate the questions being asked into a series of simple computational methods. If you can program a computer, you have direct access to the deepest and most fundamental ideas in statistics," Rauser said.
In other words, data scientist/engineers don’t need to fake it; they can get the job done using tools they actually know.
Can You Really Handle Full SQL in Hadoop?
If you’re not an Actian customer, you’re probably not doing full SQL in Hadoop, so said John Santaferraro, vice president of marketing at Actian Analytics. He suggests that when you shop for SQL on Hadoop solutions you ask questions like: “Do you support analytics functions like grouping sets, cube, rollup and windowing functions? Do you support real time individual updates, inserts and deletes? Does your security include authentication, user and role-based security? Are you running 100 percent in Hadoop via Yarn? Does your offering include visual data blending and enriching capabilities?”
He offered five other questions as well, but the point is made. Just because a vendor said it does full SQL on Hadoop, it doesn’t mean that it’s true. You have to look closely to see if they’re faking it because some of them are.
Do All Data Scientists Code?
Do data scientists need to know how to code? This was the subject of a spirited debate yesterday. Hilary Mason, a data scientist with rock star status, said that in this day and age the answer is yes. If we were still writing on tablets her answer might have been different …
Joseph Adler, director of product management at Interana, begged to differ. He argued that there are tools that non-coding data scientists can use to get the job done. He made a good case. But in the end, the crowd sided with Mason and her team.
We weren’t sold either way until Mason made one point in particular; namely if your smart enough and educated to earn data scientist creds, you can learn to code.
So here’s a message for the pseudo data scientists who don’t code, if you learn to code you won’t have to fake it.