The promise of big data is that you will uncover unexpected opportunities, information and issues. But too often, it depends on faulty correlations PHOTO: royalty free

Correlation does not prove causation — we've all heard this a million times. But we still sometimes forget it, especially those of us who buy into the promise of big data.

Try this one: if you want to predict the growth rate of email SPAM, track the market for genetically-modified soybeans. 

This is just one of many entertaining correlations highlighted in Tyler Vigen's aptly-named book, "Spurious Correlations." The book uncovers absurd illustrations of the "correlation does not prove causation" aphorism, such as the correlation between "cheese consumption and fatal bedsheet tangling accidents (94.7 percent)," or my personal favorite, the high degree of correlation between "the marriage rate in Kentucky and the number of people who drowned after falling out of a fishing boat (95.2 percent)."

A Flaw in the Big Data 'Revolution'

A central tenet of big data involves searching for correlations between operational data. 

The idea is simple and straightforward. With cheap cloud storage, we can now collect a dizzying amount of data related to all sorts of business processes, from the number of trucks arriving at each corporate loading dock, to the number of orders processed per minutes on any given day and hour, to the number of customer complaints received on the Monday following a holiday weekend.

New powerful processors and scalable databases enable skilled operators to mine these data, looking for patterns within the numbers: specifically, correlations between operational variables. By uncovering these patterns, big data promised to expose complex relationships to unlock bottlenecks and surface operational problems, thereby enabling a whole new era of data-driven productivity.

The problem is that these correlations are often just as spurious as "the estimated revenue from bingo games to the number of homes with indoor houseplants (89.3 percent)." It’s simply a matter of math. 

When given enough sets of numbers, some unrelated sets may exhibit a positive correlation — this is called a coincidence. So while my college literature course may explain why "the number of literature books published every year correlates to the number of suicides by hanging (96.4 percent)" it is most likely a simple coincidence … (apologies to Dr. Gilman).

This leaves information management professionals waving their magic wands to determine which big data relationships are meaningful and which are merely coincidences. But that’s changing.

Machine Learning to the Rescue?

New recommendation engines use machine learning to identify meaningful correlations, and to provide suggestions about "what do to next." One example is Microsoft's Delve, which professes to "discover and organize the information that's likely to be most interesting to you right now … from within Office 365."

But these claims are a bit naïve, because despite the promise of new artificial intelligence capabilities, there is an inherent problem in discovering information that’s likely to be most interesting to you right now

A simple example illustrates the problem:

Rita, a colleague and I work together on the Omega Project and we also work independently on other corporate projects. On top of our work responsibilities, Rita and I both enjoy basketball. Rita is the captain of a company intramural team, while my interest is more ‘couch-based,’ focused primarily on the company’s March Madness pool.

When evaluating what is interesting to me now, a recommendation engine like Delve looks at who is working on specific documents, and who is participating in social discussions. Based on these interactions, the engine would likely deduce that Rita and I work together on the Omega Project, and as such might suggest some of Rita’s documents to me as being interesting for my work on that project.

On the other hand, recommendation engines would likely misconstrue our common interest in sports, by recommending basketball-related documents, emails or discussions as well. 

Distracting With Irrelevant Recommendations

The reason isn’t as serious as the spurious correlations listed above, because in this case, we are indeed interested in a similar topic. The problem here is that there is no relevancy for these recommendations. Our interests in basketball are different, plus they are totally irrelevant to our current work foci. So rather than encouraging productivity, these recommendations are counterproductive, because they are a source of distraction. 

While this may be a simple and contrived example, it is easy to see how a lack of relevancy can cause recommendation engines to surface irrelevant and distracting suggestions.

One way to suppress uninteresting relationships is to use human feedback (i.e. supervised learning) to train the tools on which relationships are interesting, and which are not. These tools may take many iterations to ‘get it right’ and in a highly-dynamic environment, it might be impossible to reach a state where recommendations hit the mark.

Context Means Never Having to Say I’m Sorry

A much simpler method to surface relevant correlations already exists: 'small data’ in the form of context. 

Context provides the situational awareness that can make an apparently-complex situation crystal clear. Perhaps the simplest example of context is location. Google Now uses your present location to provide highly-relevant search recommendations, such as businesses located within walking distance. 

Other forms of context include people (e.g. who I work with) and time (e.g. overlapping calendar appointments).

One of the most promising context types for enterprise recommendation engines is topics.

In the corporate example above, assume I just received an email from Rita describing a delay in the Omega Project caused by faulty components shipped by one of our suppliers. The topics listed in the email message provide the context for what is most interesting to me right now, namely, the Omega Project, the component supplier and the specific customer account. 

As such, a recommendation engine should suggest emails, documents and business transactions related to one or more of these topics. And it’s clear that using email as a context anchor eliminates the possibility of suggesting irrelevant content, like basketball-related emails, since they are completely unconnected to my current focus.

Of course, email is not the only context anchor. A document, a chat session or a CRM record all work. In essence, any artifact that specifies my current focus on the Omega Project, the component supplier and the customer.

Why Wait for AI When the Future Is Already Here?

Many articles have been written about how artificial intelligence will impact the way business gets done, from self-driving vehicles to fully automated factories. 

While these technologies are still years away, the ability for recommendation engines to influence our daily schedule is already here. The previously-mentioned Microsoft Delve is one example. However, it focuses solely on Office 365 and it primarily uses the connections between people to score relevancy. 

Other products from startups focus on using artificial intelligence to make sense of email. Yet other companies focus on extracting and matching topics from multiple cloud services to surface the most relevant content recommendations.

So, while some companies will continue to manually make sense of their own version of ‘the connection between the number of Facebook users and the total US wind power generation capacity (99.3 percent)," others will be focusing on how AI can really impact their business, by leveraging meaningful topics to help them focus on what really matters at work.