Big data's emergence promised tremendous opportunities for businesses to gain real-time insights and make more informed decisions. But as is often the case with disruptive technologies, the innovations behind big data created a critical problem: data drift.
Data drift creates serious technical and business challenges for businesses looking to harness the insights and full potential big data offers. Fortunately, informing yourself about data drift is the first step toward mitigating its harmful effects on data quality.
What Exactly Is Data Drift?
Data drift is a natural consequence of the diversity of big data sources. The operation, maintenance and modernization of these systems causes unpredictable, unannounced and unending mutations of data characteristics.
Consider systems, such as mobile interactions, sensor logs and web clickstreams. The data those systems create changes constantly as the business tweaks, updates or even re-platforms those systems. The sum of these changes is data drift.
Data drift exists in three forms: structural drift, semantic drift and infrastructure drift.
Structural drift occurs when the data schema changes at the source. Common examples of structural drift are fields being added, deleted and re-ordered, or the type of field being changed. For instance, a bank adds leading characters to its text-based account numbers to support a growing customer base. The change to the data causes the bank’s customer service system to conflate data related to bank account 00-23456 with account 01-23456.
Semantic drift occurs when the meaning of the data changes, even when the structure hasn’t. A real-world example comes from a digital marketing firm that saw a sudden revenue spike. After some deep digging, they determined that the spike was in fact a false positive caused by their migration from IPv4 to IPv6 network addressing, which led the agency’s analytic system to misrepresent the data.
Infrastructure drift typically occurs when changes to the underlying software or systems create incompatibilities. In the big data world this is a common occurrence since we come to rely on a multitude of data source systems which others govern, with each on its own upgrade path.
Data Fidelity and End-to-End Operations Suffer As a Result
Data drift hurts data fidelity and damages the reliability and productivity of downstream data analysis. Data fidelity issues arise when data becomes corroded, is lost or is squandered.
Corrosion occurs when data that has drifted passes into data stores undetected. Loss occurs when data drift causes information to be errantly removed from the data stream because it doesn’t conform with the expected schema. Squandering occurs when a new field (and therefore new information) is created at the source but is ignored by an inflexible ingest process.
These three consequences of data drift pollute data stores and downstream analysis by causing the data set to become incomplete, inconsistent and inaccurate. Until detected, the use of this low-fidelity data can lead to false analyses and damage the business. If and when the bad data is detected, it creates “janitorial” work for data engineers and scientists as they attempt to clean up the mess.
In a world driven by big data — where change is a constant — any processing architecture requiring stability and control is doomed. This is particularly true of legacy extract, transform and load (ETL) processes, which rely on well-defined and stable schema.
ETL systems are brittle (they fail when structure drifts) and opaque (they cannot monitor for semantic drift). To keep up with data drift, organizations must constantly patch ETL processes in an ad-hoc fashion, creating endless low-level work for time-strapped data professionals.
Three Sayings That Dissect Data Drift Damage
When it comes to summing up the impact of data drift, three sayings come to mind:
- Garbage in, garbage out: Polluted, incomplete or misinterpreted data leads to false insights and missed insights, which drive improper and potentially harmful business decisions
- Trust takes a lifetime to build and only a moment to lose — Warren Buffett: Once an organization loses trust in its data, the reputation of its big data initiative is damaged
- The biggest expense is opportunity cost: The brittleness of ETL processes creates janitorial work that makes it difficult to respond to new business requirements and innovate
Data drift erodes data fidelity and crushes the efficiency and reliability of the enterprise’s data operation. Businesses require new tools and operational approaches to tackle data drift and fully harness the business value of their big data flows.
Title image Luca Sanon