old miner shack
PHOTO: Xiang Gao

While the analytics world is being transformed by the EU’s General Data Protection Regulation, marketers are discovering a potential new solution to an old challenge: How to best manage data.

The challenge has always been how to arrange processes to highlight problems with data. Marketers who are not technically savvy may feel overwhelmed by this, but the R programming language can carry out certain tasks when readying data for advanced analytics or machine learning models.

I discussed a few starter ideas in an earlier post titled “Here’s Why Every Modern Marketer Needs R Programming.” Now let’s take a look at the workflow to implement those ideas. I should note that, while there are some syntax differences, many of these tasks can be used with other programs meant for data modeling, such as Python.

Digging Into Data with R Programming

A good starting point is the data import. The good news is a large number of R programming libraries are available to interface with a database or an API of a platform. The libraries include twitteR, which you can use to query tweets from a Twitter feed; RMongo, for accessing a MongoDB database; jsonlite, for JSON data; and haven for SPSS and SAS files. All of these libraries can be found through a quick search of the Comprehensive R Archive Network.

Once you've set and imported your data sources, your focus should turn to data wrangling — the process of mapping from one row format into another — and transformation — merging, splitting and rearranging columns and rows.

It's a good idea to map out the metrics you are using to determine if a task fits into one of the following mathematical categories:

  • Discrete metrics: Discrete metrics are values that are distinct. In other words, they are metrics where each field has a discrete number or a category (male or female; laptop, tablet or smartphone etc.).
  • Continuous metrics: Continuous metrics are metrics that can have a number of values. Think of calculated metrics, such as an average order value or a ratio, in which the actual value itself can vary.

Framing metrics within the context of discrete and continuous values reveals how the data will be processed in an R programming file so that it is read consistently from a column and held within a data frame — a tabular object for data. A data frame can only hold a nominal type in each column, so this task ensures that a mixing of types is avoided.

Related Article: Making 'R' More Accessible to Marketers

Another potential step is verifying the columns planned. Are headers from the data source provided? R offers a programmatic way to add headers to data once the data is imported.

Most importantly, are the headers the same labels used by all parties who will access the data? That question can determine if there is a more efficient way to repeatedly access the data without having to manually refine columns before the data is placed in a model.

Most programming languages provide libraries for transforming the data — to move columns to rows or vice versa. For R, the following basic libraries can ease transforming tasks:

  • Readr, for guessing functions and to read data in rectangular tabular formats like CSV.
  • Tidyr, for organizing tabular data in a consistent structure and managing missing field values.
  • Dplyr, for transforming data once it is added to R.

These libraries assess how the organization of data will be interpreted by the code and data fields. They can help reveal empty values in data fields, which can be misread as anomalies within a machine learning model. At the end of the day, an analyst accounts for the field values within a given column, with each column and dimension filled with a distinct attribute.

Marketing Knowledge Still Plays a Role

Finally, marketers should not dismiss their domain knowledge when modeling data. Sometimes your experience will help you see the best way to treat an outlier for a model. Or you might help your technical team understand where to adjust data in the cloud in a scenario where other teams downstream assess data. Your own team will benefit from guidance on which data should be queried and how it should be parsed.

The aforementioned steps will have more specific tasks involved, depending on your end application, so expect to invest some time and effort.

Overall, any steps involved with data mining should result in data organized in a way that enables it to be used repeatedly. That repeatability is vital to allowing machine learning and data visualization tasks become convenient and efficient for business decisions.