With the proliferation of data and automated analytic tools, spurious “insights” and meaningless correlations become more common every day.
One cause of this is the blind application of unsupervised learning techniques. Any successes that approach seems to yield can probably be attributed to the fact that, by random chance alone, analysis will always discover something interesting in data.
Another cause is dogmatic attention to pet and faith-based theories or, at the opposite end of the spectrum, a lack of base knowledge about what drives a business. Data can be massaged to support any perspective. Remember the aphorism Mark Twain popularized: “There are three kinds of lies: lies, damn lies and statistics.”
The challenge we face is to separate what we learn via our biases and interesting coincidences from the true insights to be found in data. Here are some quick, and some not so quick, tips for getting closer to the truth.
Apply Common Sense
In some cases, the data built up over a lifetime tells you everything you need to know. For example, I may have started to make more money at around the same time I started to eat bananas, but it’s a ridiculous supposition to say that I make more money when I eat bananas. When it comes to some suggested correlations, our finely tuned common-sense meters can often quickly identify and rule out any phony relationships.
Of course, it also pays to keep an open mind. Many discoveries have come from refuting what everyone knew (or thought they knew) to true.
Split Your Data, or Look at More Data
A common way to split your data is to set aside a “holdout” data set before running any analysis. This is typically done by randomly splitting your data, say on an 80-20 basis. Of course, the danger here is that analysts start to optimize their efforts around what the post-holdout analysis says, rather than thinking deeply about the interactions in the core model.
Another effective approach, depending on domain, can be to look at different time slices. Maybe I run an amusement park and only eat bananas in the summer, when the amusement park brings in more revenue. By looking at the data on a seasonal or monthly basis, we will see whether the banana effect is really just a summer effect rather than a true driver.
Take the Time to Look at Your Data
We can use computers do more than just crunch numbers. For example, we can use them to create charts that provide a visual representation of data. Generating a large number of charts that let you look at your data visually can be critical. The human perspective is very different from the machine’s perspective. Items that an algorithm will miss can be glaringly obvious to a person. For example, a computer would not care that a U.S. political ad was purchased in rubles. A person, however, should instantly recognize that as an indication that something is amiss.
Do Not Torture Your Data
A frequent pitfall to be avoided is torturing the data until it agrees with you. It is often easy to explain away inconvenient truths that invalidate our hypotheses: The bananas were too green one week when revenue was down, they were bruised another week, and they were good that other week, except we bought them on Tuesday instead of Wednesday. Before we know it, we’ve trimmed the data set to only the results that match our desired outcome — the banana effect.
Two effective ways to counter our inclination to rely on our biases are, first, to establish a standard set of criteria for removing certain types of before any analysis occurs and, second, to reframe the removal question to examine the decision from alternative viewpoints. You must decide whether you are cutting data from the analysis for convenience or because there was an actual problem, such as a server crash.
Create Experiments to Test the Results
Can you replicate the outcome of your analysis? Is there a way to put the outcome to the test, or should you simply eat more bananas?
As humans, emotional attachment to our pet theories can often blind us to the truth. A corollary to this is, “Don’t create experiments that are certain to reinforce your position.” For example, if I set up my experiment so that I only eat bananas on payday, the result will reinforce my earlier intuition that I make money when I eat bananas.
There are many articles and books that describe ways to validate or replicate results through experimentation. The key is to take that first step.
Enjoy the Banana
As we rely on data processing to identify alternative and unique trends and insights, we will come across more and more surprising “relationships.” Some of these will be real, others coincidental. But almost all of them will be nonobvious, simply because we are already aware of the obvious relationships.
Splitting the data before analyzing, looking at the relationships between fields and using common sense all serve as a first line of defense. After that comes the interesting work of testing to replicate the results and explore their robustness.