Did you have a chore you hated doing when you were a kid? For a lot of marketers, maintaining clean data can at times feel like the grown-up equivalent.
But cleaning data doesn't have to feel like a chore. In fact, the steps involved should provoke much-needed discussions on how to better protect the privacy of the data you collect. Structuring clean data discussions with privacy compliance in mind can highlight how an organization can better conduct compliance and ease regulatory fears.
Related Article:Accepting Privacy as a Customer Experience Issue
Linking Clean Data Efforts to Data Privacy Efforts
The timing is right to strengthen efforts in both of these areas as organizations grow increasingly concerned about new regulations. The first major legislation since the arrival of GDPR, the California Consumer Privacy Act (CCPA), is scheduled to go into effect on Jan. 1, 2020, with amendments being debated between now and that time. More legislation, like the New York Privacy Act, is expected. These privacy measures will inevitably spark deeper discussions on data vigilance.
So how can manager approach clean data in a way that aids organizations in their data privacy efforts? The key is to consider three critical aspects that define clean data in an advanced model and consequently define the activities needed.
Clean data is identifiable to you
When you look at a data table, you understand what data populates the fields and what the values should be telling you. Your understanding will be based on the subject in which the data is being applied and will drive the degree of data literacy needed to properly clean the data.
Clean data organizes data into an intended format
The data has to be organized in a way to allow use in a data model, no matter if the data fields appear in a .csv file or SQL databases. The format applied to every field should reflect the format you want to for the models you intend to build.
Clean data has no obvious bad details
Obvious can be a subjective term, because you are relying on what is obvious to the professional doing the scrubbing. But that professional should spot records in your data that are inaccurate, irrelevant or incomplete. These records must be repaired or removed.
Learning Opportunities
Related Article: Data Ingestion Best Practices
A Holistic Approach to Customer Data Practices
Making data clean for machine learning provides obvious value for an organization. What may not be as obvious is how much any clean data discussion relates to the topic of privacy protection. Many compliance measures, from GDPR to forthcoming legislation in the US, require identifying a data processor and controller. These are the teams responsible for identifying the impact of data usage within an organization, such as retention of data, declaring the purpose for data collection, and documentation of associated processes.
Thus, many aspects of this clean data checklist dovetails with privacy compliance requirements. If an analyst is deciding what is identifiable, it may help to determine what identifiable elements relate to Personal Identifiable Information (PII). The discussion on intended format can reveal how the data could potentially be combined to reveal someone’s identity in a data breach, making it clearer which data fields are critical for identity protection.
Furthermore, processors and controllers have data retention responsibilities. How long should clean data be held? What are the procedures for identifying the length of time data is held? The answer to these questions can dictate the type of analytics services you'll need. Furthermore, they can help verify how legacy data is being processed.
Related Article: Let 'Ethical By Design' Guide Your Use of Consumer Data
Clean Data Is a Start
Clean data won’t solve every problem — and for many professionals its associated tasks will feel like chores no matter what. But with the right mindset, this task can highlight how to best manage privacy and accurate model analysis, two duties that are becoming essential to any successful organization.