Decluttering fever is sweeping the country thanks to Marie Kondo. But clutter doesn't only pile up in the physical world — it can be found in our digital worlds as well.
Marketers should take the decluttering lessons to heart when it comes to selecting data for their machine learning initiatives. Too many unnecessary data categories can raise issues that will bedevil the effectiveness and accuracy of machine learning models from the start.
The Curse of Dimensionality
Determining data variables for a model is an important first step. These variables represent features of what the output of an analysis model represents — for example, a product, service or operating condition.
Selecting the right number of data variables can be difficult — just how many variables are needed? It can be hard knowing where to start when faced with multiple sources of data, be it device sensors, location data associated with GPS, Point-of-Sale data, or third-party data from a data lake. This kind of impasse brings us to the curse of dimensionality.
The curse of dimensionality contends that as the number of dimensions grows, the amount of data needed for accuracy increases exponentially. These dimensions represent the correlated variables of a model.
In other words, when we indiscriminately add features to a model, we will reach a limit. Go beyond that limit, and the amount of data you'll need to train will increase to unreachable levels. Your data represents real-world products, services or operations. This puts a cap on the amount of data you can realistically expect, given the business resources available.
Therefore marketers must only add the dimensions that are most relevant to predicting the output variable.
Pushing these limits will otherwise come at the cost of the accuracy of the analysis. Adding dimensions that do not add substantial information to the training data for models can have a detrimental effect on model accuracy.
Related Article: Here's Why Every Modern Marketer Needs R Programming
Principle Component Analysis: A Remedy to Dimensionality
Principal component analysis (PCA), a dimension-reduction method meant to reduce a large set of dimensions variables into a smaller set of (uncorrelated) dimensions, is one tactic marketers can use to avoid the curse of dimensionality. The purpose of the smaller set, called principal components, is to examine the variance and determine the right dimension features that a statistical or machine learning analysis can run. For machine learning it reduces the likelihood of overfitting, a condition in which a model has more parameters than a set of data can support.
PCA does involve math, but programming solutions that run on Python or R — the popular languages for data science and machine learning — are readily available to assist. For R programming, a library called DataExplorer includes a PCA function that can take data, apply the normalization and graph the results.
Data scientist Boxuan Cui provided a great example online (the image below recreates his example). A dataset (nycflights) is first imported. The data is created in a list, then the columns merged — in a business instance the mergers represent a first educated pass at features listed in columns. A dataframe is created with a “na.omit” function to create a list with missing data fields eliminated. The resulting graph shows which principal components cover the most variance — and as a result, which data set to use to minimize the likelihood of overfit.
To preview the features and associated data, marketers can use cloud database repositories like data.world and Kaggle to examine how data appears in database columns. Data.world also provides a preview window to explore column headings and visualize how data fields will appear. The preview can help you plan joins, unions, merges and reshapes — data management methods in R, SQL and Python which are applied to datasets that will be used frequently in machine learning training and testing. (Side note: you may find the discussion of planning concepts for data tables in the article "How Marketers Can Plan Data Mining With R Programming" helpful.)
PCA isn't the sole dimension reduction protocol available. There are others, depending on the linear condition of the data and for niche machine learning needs. But dimension reduction is a great step to gain an understanding up front of what will work for a machine learning model. This tactic will likely grow more prevalent in machine learning as businesses learn to leverage data from embedded devices and customer-facing solutions.
Related Article: Where Analytics and Machine Learning Meet
Learn how you can join our contributor community.