Almost anyone that works with technology and data has heard the mantra, “Trash in, trash out.” Invalid data is not a new phenomenon, but the age of big data has made this even more pronounced because the sheer volume of data makes human examination impractical. Open source business analytics provider Pentaho has teamed with data quality company Human Inference to make it a little easier for enterprises to ensure their business data is trustworthy.

More Data, Less Quality

Anybody that’s even peripherally interested in technology has likely been exposed to at least one conversation or article about spiraling data volumes. While a lot of conversation has focused on the challenges around storage, new sophisticated tools for analysis and the countless opportunities the data can unlock, it seems like only the people responsible for working with big data are acknowledging the ugly little fact that big data can mean big quality issues.

Bad data can sometimes be worse than no data. It can waste valuable time and resources. It can provide misleading answers that drive bad decisions. It’s just, well, bad. Pentaho has integrated Human Inference’s EasyDQ platform with Pentaho Business Analytics to make it simpler for companies to manage data quality across the enterprise.

The new plug-in provides:

  • Data profiling -- helps organizations understand data contents (e.g. what’s the highest value in a column, how many nulls exist, how many rows exist).
  • Data cleansing -- Address cleansing and standardization for over 240 countries, names cleansing, phone number checking and formatting and email cleansing.

These features can make it possible to get better quality data in the hands of end users faster and reduce the always frustrating and spirited discussions of, “that’s not what I found when I looked at the data.”

Getting More Details

The plug-in is available now for version 4.2.x and above of Pentaho Data Integration/Kettle. In addition, Pentaho and Human Interface are holding a webinar on Thursday, May 10. The webinar is free, but registration is required.