How Marketers Can Get Started Selecting the Right Data for Machine Learning Models

Remember feeling nervous before a school exam? For marketers today, a quiz about what keeps a machine learning model in balance would feel just as scary.

If you are a marketer charged with establishing a data model and feel you need to cram for your quiz on working with machine learning, understanding how overfit and underfit works in data modeling is a great place to start. When marketers learn more about how machine learning concerns impact their workflow decisions, it helps them work better with technical teams when selecting data for machine learning models.

Machine Learning Basics: Where Bias and Variance Fit in Overfit–Underfit

Overfit is a condition that treats noise in training data as a reliable indicator rather than an anomaly. Instead of dismissing the data as the noise it is, it accounts for the uncharacteristic data within the given data set. When the training data is influenced by noise, the model creates poor predictions from any new data set that is missing the same or similar noise — namely the production data.

Underfit is a counter-condition, which causes different performance issues in machine learning models. Underfit implies that the model or the algorithm does not capture all of the data well enough to understand the statistical relationships among the data.

Both overfit and underfit are expressed as statistical error metrics called bias and variance. Bias is the degree of learning error a model makes in simplifying its inferences from a given set of training data. The simplifications makes it easier to predict the parameters needed to establish the model.

Variance, on the other hand, is the degree of model sensitivity, displaying how much an estimate of the target function will change. It is meant to express how varied an output from a model will be when given different training data.

Taken together, bias and variance describe overfit and underfit in a model. Overfitting occurs if the model or algorithm shows low bias but high variance and underfitting occurs if the model or algorithm shows low variance but high bias.

Marketers Can Enhance Their Knowledge to Improve ML Decisions

Understanding the bias-variance tradeoff is an ongoing problem in machine learning is the key for marketers to work well with data analysts. An ideal model accurately captures bias and variance in its training data, yet generalizes unseen data well. In other words, the learning error (bias) is minimal and the model sensitivity (variance) is minimal. Unfortunately, balancing bias and variance to minimal values is typically impossible to do simultaneously.

Models with high variance mean the model does a good job of encapsulating the training dataset but it is overfitted with noisy or unrepresentative training data. On the other hand, models with high bias typically produce overly simplistic models, overlooking important data parameters.

You're probably getting a sense of how complicated decision making is for managers and analysts around data and data models. Marketers can play a part here. Marketers have critical insights into how key data assumptions can influence the balance of these models, helping reveal any bias or variance concerns when establishing the initial data variables and data sources. For example, in my post on dimension reduction, I explained why selecting too many variables makes a model untrainable. I also mentioned in my post on machine learning pipelines the number of options available to help marketers organize data-related tasks with teams without requiring an understanding of every model programming detail. When marketers keep general bias and variance qualities in mind, it will help them make data model decisions and recommendations for using the right data resources.

As any good teacher likely taught you, planning your resources is how you will ace your test.