Every CIO has heard the siren call of big data. 

With it, we’re promised, we will glean new insights from customers, markets and competitors, speed time to market, detect anomalies and security outliers, and provide the foundation for machine learning algorithms which in turn form the precursor of artificial intelligence.  

The challenge is getting from promise to reality. The potential is there, but challenges and roadblocks stand in the way.  

‘Big Data is Like Teenage Sex …'

“... everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ...”  — Dan Ariely, author “Predictably Irrational”  

Saying no one is doing it might be an exaggeration. Organizations are getting value from big data. The ones doing this understand their objectives, know the questions they want to ask, and have successfully architected their data and processes to produce and act on insights.  

This begins with identifying what data you have and understanding its value. The challenge is that most organizations are unaware of the full scope of their data sources and don’t understand how to apply that data to produce value for the enterprise.  

Consider the data “exhaust” thrown off by marketing applications. Marketing integration suites capture a huge variety of behaviors and interactions. If that information about the user’s “electronic body language” is not instrumented into content processes and engagement strategies, it goes to waste.    

Double Loop Learning – Pulling Levers vs. Changing the Levers to Pull

Don’t let the information go to waste. In order to use it well, answer these questions:

  • What you are trying to measure? 
  • Would you know it if you saw it? 
  • What would you do if you had it? 

You'll find the answers in the data indicators related to the customer lifecycle, and the learning and adaptation that result when acting on the data. 

“Double loop learning” — looping insights back to the organization for action — will help you accomplish this.  

The first learning loop is the cycle of data collection, observation and intervention and then cycling back to data collection in order to see if the intervention had the desired impact.  

The second loop is a larger review of the process and macro objectives, where lessons learned and interventions can change the hypothesis and data collection. The objective is the same, however, the lessons learned are taking place at two levels – macro and micro.  

At the micro level, you make small adjustments to content, organizing principles, product groupings, associations, cross sell rules, upsell configurations and “pulling levers” based on the data. At the macro level, you change your hypothesis and the types of interventions that take place at the micro level (changing the levers that you are able to pull).    

Clues from the User Vapor Trail

The “vapor trail” of user behavior data can tell if a piece of content, a relationship or merchandising attribute is ineffective — it fails to lead to clicks, conversions, downloads or whatever behavior you're trying to achieve. 

While small adjustments to the content can at times be made with a measurable impact, if an entire program is ineffective, then your metrics might lead to a significant revamp that requires development, testing, quality review and deployment — a process that can take months from start to finish. The clock speed and intervention action of these loops is vastly different.  

These scenarios depend on having a hypothesis about the correlation of observed data and the action. But what if you don’t have a clear understanding of the outcome? The variables and volume of interactions can be too large to make such changes manually.   

Adjusting content, product data and promotions for large enterprises at scale requires more automated approaches. But automated approaches still require metrics, objectives and hypotheses as part of the feedback loops that machine learning algorithms can leverage.  

Back to Big (and Large) Data

What does big data really mean when it comes to extracting value?   

There are large data sources and big data sources. Large data might be transaction processing output from a large retailer: Well structured, good quality data with a defined schema, where all of the elements are understood, consistent and normalized so that apples to apples comparisons can be made.  

In contrast, big data that streams in from a variety of sources might contain different definitions of a customer — one defined as a family and the other as an individual. These results are harder to compare. 

Moreover, in a big data scenario, the volume could be so great that conventional hardware and software takes too long to perform analyses or becomes too expensive to scale, even if the data is clean and well-structured.

Big data technologies help address this by running the analyses on lower cost commodity hardware and splitting processing up into smaller jobs. But in general, the big data that companies want is not well-formed, defined, clean or consistent. New discoveries come from analyzing and combining disparate sources and processing these in a way that looks for patterns and correlations across data sets.  

Time for a Swim in the Data Lake

For example, weather and traffic conditions impact a retailer’s sales. Those data feeds come from different systems in different formats, or potentially from different sources within a particular system. The systems may have different designs and conventions for naming the data elements. 

So-called “data lakes” allow these disparate types and structures to be stored in a repository without the predefined structures that traditional systems such as data warehouses require. You can keep adding data sources that consist primarily of sensor data or text information — Tweet streams or remote traffic monitors — and use algorithms to process and analyze the information and look for patterns. 

Keep in mind how fast data changes. Sensor data might stream all day long versus sales results that are batch processed at the end of the day. If you're looking to correlate sales with sensor data in real time, the velocity of the sales data would increase.

Real time versus batch data for a single product category changes more slowly than the cumulative transactions across all categories. The amount of data (the volume) increases as well as how quickly it changes when sampled in real time. This is the velocity of the data. Considering that sensors can throw off continuous streams of data (perhaps monitoring not just vehicle traffic but pedestrian flows) and that there can be hundreds or thousands in a target region, and the velocity continues to increase.

Enter Omnichannel

An omnichannel campaign objective might be to understand the impact of traffic and weather on in-store and mobile promotions and customer segments. The data might include sensor data from stores measuring pedestrian traffic, clickstream data on the retailer’s web site, and mobile data from third parties correlated with anonymized demographic data.   

All of this data streams in very quickly, with new data points generated every second. You now have physical and behavioral data combined with traffic, weather, pedestrian traffic, mobile phone data, and you have that across dozens of demographic segments (the granularity of this data is scary — e.g., down to looking at “trend-setting soccer moms without college degrees interested in crafts”). Now you add in your promotional campaigns for spring.  

What do you make of such a data stew? This is the point at which the data has to be processed. 

Developing a set of hypotheses will narrow down the information you need for analysis. If you are looking for conversations about your products, you need a definition of the product and the variants in how people will describe the product — including misspellings. If you want to know positive or negative sentiment, the system has to determine if there are variations specific to product characteristics that people will call out in a positive or negative way.  

Since definitions of sensor data parameters will differ according to the system, those also need to be reconciled. The ways we process the data are part of a family of algorithms called machine learning. Machine learning looks for patterns, classifies data and content, and predicts patterns based on new data sets. But none of this is possible without knowing what to look for. 

In this way, big data is no different from BI and traditional analytics. Anomalies, outliers or patterns are there to be found but which ones? The conditions of low foot traffic on sunny days? The makeup of pedestrians on cloudy days when you run clearance sales? The characteristics of people who use promotions sent on iPhones who dislike the competitor products? The frequency of products purchased as a part of high margin solutions correlated with customer segment and promotion strategy? 

The questions and possibilities are endless. 

Don't Let Your Lake Turn Into a Swamp

The danger with data lakes is they can quickly turn into data swamps. Data needs to be organized and extracted. 

Just as with BI and data warehousing, you'll eventually need reference data — the standard names of products, markets, customer types, promotion types, demographics, classes of outliers, classifications of data types and conditions, security and privacy constraints and types of learning algorithms and processing models.  Data and metadata need to be catalogued and defined with the correct history, lineage, ownership, usage rights, source information, quality and accuracy assessments. 

Organize data in catalogues with meaningful business terminology. To harness the power of big data and machine learning, we need organizing principles that act as the foundation of all analysis.   

This is not a trivial task. Some leave this to technical teams and others to business teams, but these teams may not collaborate effectively or understand the larger picture. A contextual enterprise architecture is the scaffolding and framework for knowledge in the organization. Increasingly, that knowledge is gleaned from diverse sources and hidden in streams that continually flow through the organization.  

Big data needs to be smarter, and contextualizing it adds the intelligence. 

Business users are looking for insights. Insights come from understanding patterns in data and seeing causal links and connections. By finding the cause and effect relationships between the things we have control of and the things we want to influence, we can start to predict the behaviors of customers, employees and products in the field.  

The starting point is consistency of language, concepts and terminology — the knowledge architecture — supported and maintained by governance and ownership.

Title image "M.C. Escher" (CC BY 2.0) by  N. Feans