There are all kinds of clichés thrown around technology conferences at the moment. Here are two about data that are currently doing the rounds: "Data is the new oil" and "garbage data in, garbage insights out", a reference to the impact of feeding "dirty" data into the machine learning and artificial intelligence tools.
When Is Data Dirty
Dirty data, also known, as rogue data is inaccurate, incomplete or inconsistent data, especially in a computer system or database. In a paper from MIT in 2017, J. Schneider and J. Kitsuse defined dirty data as follows, “Hidden and dirty data is information which is kept secret and whose revelation would be discrediting or costly in terms of various types of sanctioning. The data can be dirty in different ways. But in all cases, it runs contrary to widely shared standards and images of what a person or group should be.”
Dirty data can contain such mistakes as spelling or punctuation errors, incorrect data associated with a field, incomplete or outdated data, or even data that has been duplicated in the database. They can be cleaned through a process known as data cleansing. This has been an ongoing problem for enterprises particularly with new regulations like GDPR and the upcoming California Consumer Privacy Act (CCPA), which threaten stringent penalties if private, personal data used by organizations was made public.
According to Gartner, augmented data management will become a feature of the many organization’s technology landscape and will enable them improve organizations' ability to analyze data that is coming in more dynamically and with greater levels of automation in closer to real time. In doing so, it will also be able to help clean that data. However, it is still a problem.
The Data Challenge
“It's a serious challenge. In organizations, it’s usually 40-50% of the effort that goes into these kinds of manual tasks around machine learning,” said Vaclav Vincalek CEO of Vancouver-based PCIS. “Realistically, it’s not going to get better, as organizations will keep getting more and more data.”
To counter that, part of your machine learning project has to be data quality assurance. IT leaders need to know that if you get a new data set, that the data set complies with your requirements. The problem is not just about how to get clean data, but how to correlate data from different sources.
He cites the example of address data. You’d think this was a simple thing to manage. But there are companies that do nothing but clean up physical street addresses in databases — because people use different terms for the street. (e.g. Street, St., ST). This problem is not new to machine learning — it’s been there ever since people started collecting data. People doing a census for the government have the same issue. The only difference is that we think we might need even more data.
When Is Data Clean Enough?
There seems little dispute about the fact that cleaner data is better data. The trouble is that no one seems to agree on when data is clean enough. These days, it’s not unusual for companies to spend six months or even a year on data cleansing projects. They often undertake these tasks without any idea of how much return they’re going to get from all of that effort. “The inside joke is that data scientists spend 80% of their time cleaning data and the other 20% complaining about it,” said Arijit Sengupta, founder and CEO of Aible. He pointed out that when companies put a priority on completely clean data, they lock themselves into a data cleansing process that ultimately has no satisfactory end. After a year of data cleansing, the data has been scrubbed so thoroughly that all of the rough edges have been removed.
Then the AI trained on the supposedly clean data goes into production and it is asked to predict based on new real-world data. However, the new data doesn’t look anything like the clean data it was trained on. So now you have to perfectly replicate the data cleansing steps you performed in training, but in a different programming language in production, hoping to turn the production data into something that looks like what the AI has seen before. And that’s almost impossible to do. “If the AI trained on the data helps you achieve 10% better business results — 10% more revenue, 10% lower cost, 10% fewer customers churned — then the data is clean enough. Businesses need to make decisions based on real-world data, not a sanitized version of reality. Manual data cleansing takes a long time to produce clean, stale data,” he said.
Developing A Data Strategy
Like everything else in the digital workplace, developing a strategy is the key to solving this problem. In this case a data strategy. Data strategy outlines the process by which data are organized, structured, and shared so that data can be utilized in machine learning models. Rather than saying that these two concepts intersect, it’s probably better to describe data strategy as the crucial step that feeds into the ability to build effective machine learning models. “Data strategy involves how data are stored and updated, data governance, security, and downstream use of results,” Paul Harmon, data scientist at Bozema, Mont.-based Atrium, told CMSWire.
For example, if explanatory data have missing values, multiple entries, or are stored in disparate systems, the efficiencies gained though ML models may be hampered or lost because the information isn’t complete or is hard to combine and use effectively. Additionally, data can be updated in one system without being updated in the other, leading to different results depending on which source data are used — even for something as simple as reporting! These problems are all exacerbated when using data for more complicated tasks, like predictive analysis using machine learning models. “By thinking through data management before major data collection begins, the effort involved later in leveraging those data to answer business questions is minimized,” he added.
Statisticians often belabor the point that more emphasis should be placed on the aspects involved in data collection (what are the questions we’re trying to answer? What data will we collect? How will it be organized) rather than how it will be analyzed and used, because one inevitably falls out from the other.
By implementing data-driven processes such as machine learning models and analytics-based reporting (i.e. dashboards, etc.), organizations will be able to highlight gaps in their data and figure out where they need to make improvements. By manually reviewing automated processes and statistical results, organizations will be better poised to identify optimal data and reduce their dependency on low-quality data.
Methods for scoring the quality of datasets and individual records are in the process of being implemented. Over the next few years, we expect organizations will have the ability to actually account for the quality of each record in statistical models, effectively mitigating the negative effect of low-quality data on the results.
Point of Entry Problems
The biggest data problems come from the point of entry. Important data points will be omitted or entered incorrectly anytime humans are involved in the data entry process. Augmented data management needs to be paired with good data governance practices, and the apps at the point of data entry need to enforce data integrity rules as dictated by downstream systems. Unfortunately, this doesn't happen often, as the application owners often have competing priorities with the data stewards and good data governance rules don't often make it into the points of input. However, augmented data management will help data stewards more easily identify bad data and provide intelligent recommendations about what the correct answer is likely to be.
It is a long road to good data governance and clean data, but augmented data management tools will be very important tools for reaching the holy grail of clean and complete data.
Aristotelis Kostopoulos is vice-president of AI product solutions at New York City-based Lionbridge. He argues that while there is no one technique to get rid of garbage data this should not be a problem. With advances in machine learning and the introduction of deep learning, we are now able to process and derive insights from unstructured data, which was impossible a few years ago. Today, text, images, audio, and video — all different manifestations of unstructured data — are produced and can grow exponentially.
Imagine the number of emails or tweets generated every single day, of images posted on social media, and of sensory data produced by IoT — using this data together with structured data from machine learning, we can get better insights.