Developing a decision tree project can feel a lot like logging into your Netflix or Hulu account. There are so many programming choices that you can stay glued to your laptop for a long, long time.
But one thing that leave you feeling chained to your screen is data drift. Data drift is the sum of data changes — think mobile interactions, sensor logs and web clickstreams — that started life as well-meaning business tweaks or system updates, as CMSWire contributor, Girish Pancha, explains in greater detail here.
Analytics Versus Machine Learning
Data drift can be particularly troublesome for decision tree analysis. The key to understanding why is rooted in understanding how machine learning differs from analytics.
If you have used web analytics, then you have been exposed to descriptive analysis such as dashboard reports that rely on charts, alerts, medians and metrics.
Machine learning takes a different approach to reaching data conclusions. Machine learning relies on both historical and live data to produce decision analysis. This approach uses induction — the process of identifying a statistical relationship in data between a wide variety of data sources. Analysts can then select the best model to be deployed on live data as a production model going forward.
Consistency is Vital to Decision Analysis
The value from decision tree modeling lies in deriving consistency from the decision analysis: will a given relationship hold up for most datasets? Any patterns detected can then be used to answer data-related business questions with greater accuracy than the results reported in analytics solutions.
But errors introduced into the data can produce errors within the production model. Historical data can produce errors within the prototype model, through the introduction of data drift.
Don’t Scale Your Errors
However, live data errors unaccounted for in the prototype model can slip into the production model. A common example happens with ‘overfit’ decision tree models, where a model performs well on training data but fails once the model is placed into production.
Overfitting generally happens when the model creates too many branches to account for data idiosyncrasies such as outliers and irregularities. That’s because in the process, the model algorithm increases the test set error, thus decreasing the accuracy of the model predictions. And be it historical data or live data, when the model is used extensively, errors can scale their impact quickly.
Keep in mind that test models never assume data drift. Yet real-world data drift can have measurable long-term impact on advanced machine learning processes such as the kind of deep learning algorithms that attempt to model high-level abstractions in data or bots that are made smarter through brute-force training.
Learning Opportunities
A Decision Analysis Cheat Sheet for Marketers
So what should marketers who aren’t data scientists watch out for?
1. Prototype your models before scaling them
Consider using simple prototype models to flush out categorical problems or reveal bad values before the model is scaled. A good prototype should give some indication of what basic threshold conditions should exist. You can then recreate those conditions to allow for early evaluation and nip decision tree overfitting in the bud before it starts.
2. Identify decision steps where your model will face multiple outcomes
When planning a decision tree process, whiteboard and highlight potential decision steps where your model will face multiple outcomes. At these points, it may be possible for your model to learn something that holds true in general — or only discover patterns that hold within one certain dataset. It is at these points where tidy data is especially critical to avoiding errors.
3. Evaluate both the objective of the model and its assumptions
Review any categorical assumptions applied to the data, particularly for techniques such as clustering, which assigns data to groups based on scoring applied by analysts. Assigning too many categories to data can introduce unnecessary variables, so marketers must qualify the categories against the model’s objective to avoid introducing selection bias.
4. Cross-validate your data to optimize your solution
Establish a post-tree procedure using cross-validation data to check the effect of pruning. Cross-validation data tests whether improvements will come from expanding a labeled decision tree point, known as a node. If the results show a more optimal solution, then the model continues expanding that node. But if it shows a reduction in accuracy, then the node should be converted to a leaf node, which denotes the end of a decision tree branch.
5. Ensure that all data come from the same time period
Training data and testing data must be drawn from the same time period. Training data from one time period and testing data from another time period can lead to misleading results, since the data may well change from period to period.
Tune In to Your Data and Stay Curious
Ultimately marketers using decision trees must keep the same perspective that good analytics practitioners must maintain — a high level of curiosity when reviewing data details and proactivity in reviewing results against objectives. Doing so won’t solve every problem, but it will make conducting an analysis as rewarding as binge watching your favorite show.
Learn how you can join our contributor community.