pup drinking water
PHOTO: Ryan Christodoulou | unsplash

You know the feeling when you go searching for an afternoon snack and despite your refrigerator bursting at the seams, you can’t find anything you want to eat? Or your closet is overflowing with clothes, but there's nothing to wear? Well the same can be said for a data scientist in search of data: there’s tons of it, but never enough.

Despite the vast amounts of data companies are collecting today, there’s often not enough of the right data to train algorithms to perform specific functions. In fact, Gartner found poor data quality costs businesses anywhere from $9.7 million to $14.2 million annually.

Even when data quality is excellent — clean, properly labeled and classified — organizations may find that they need more. Organizations need close to 10,000 label data points at least to effectively provide the information to make AI intelligent, able to draw patterns, extract insights and generate predictions.

However, data is typically difficult to procure and often times the data collection process can be more arduous and time-consuming than building the actual machine learning models.

When this happens, data scientists must resort to synthetic data, which, as the name implies, is data that is artificially contrived and based on made-up events.

Synthetic Data: The Additive to Effective AI

Synthetic data is not as random as it sounds. It’s based upon an expected outcome or hypothesis, as well as the statistical properties of the real dataset.

Here's an example of the effective use of synthetic data: a healthcare insurer needed to determine how frequently customers with kidney disease file claims and for what reasons. It found it didn’t have enough internal data to support the algorithm, so it rounded it out with synthetic data. The added data was created from fictional, yet possible, claims on typical ailments of those with chronic kidney disease, and informed by elements of the initial dataset. 

The irony of it is that real data is used to develop synthetic data, which continuously feeds itself to become more and more accurate.

In another example, synthetic data can be used to train a computer vision application. Perhaps an urban planner needs to identify how many eight-wheelers use a specific stretch of highway each year.  The computer vision app will be tasked with identifying each one in an image. Yet if there are no images available, a 3-D model may be created and strategically placed in different scenarios, ultimately training the model to identify the differences between eight-wheelers and smaller trucks or cars.

In both examples, data scientists have found that when they use a base model, which can do something very similar to what you want, and retrain it with synthetic data, the models can become highly accurate. 

Related Article: How to Have the Hard Talk About Marketing Data

Synthetic Data: The Antidote to Privacy Concerns

Data privacy is not only becoming a key concern, it’s also the law. Especially in regulated industries such as healthcare, banking and finance, companies must secure the privacy of individuals’ personal information. This can pose a problem when that sensitive information is exactly what’s needed to help an algorithm make accurate decisions.

To work with personal data, companies have to find a way to anonymize the data, utilizing only non-identifiable data and transferring it securely. And even when all these time-consuming steps are taken, there’s still a level of risk and often still a need for more data. By taking synthetic samples from the real dataset, companies can leverage the key characteristics of the original datasets without compromising data privacy.

Related Article: That's the Way the Cookie Crumbles

4 Tips for Successfully Leveraging Synthetic Data

The role of synthetic data is necessary and growing, but not all synthetic data is created equal. What can you do to ensure the synthetic data that is fueling your algorithm will be most effective? Consider the following:

  1. It’s all in the base. Start with a base model that proves the algorithm can work and then source the data accordingly.  By starting with an algorithm that is fairly accurate, you know that more of the same info will only make it smarter.
  2. Consider the source. Make sure your synthetic data provider has expertise in the full lifecycle of AI development. They understand the importance of clean data and how much will be needed to work most effectively, as well as the role of testing.
  3. It takes more than a tool. Generating synthetic data can be quite complex and requires human knowledge, not just an analytical tool. It also requires advanced frameworks for validation, which requires specific talent trained on the systems.
  4. Look for data that goes deep. The most coveted synthetic data is that which addresses very rare or specific issues, for which very little data exists. When sourcing data, make sure the data sets have been created based on a deep understanding of the characteristics and attributes of your specific use case.

Gartner estimates that by 2024, 60% of the data used for the de­vel­op­ment of AI and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated. As the need for relevant training data grows to enable increasingly more informed AI, organizations and data scientists will be turning to synthetic data to augment their existing data for the nurturing and feeding of their solutions. After all, algorithms can be gluttons when it comes to data — there’s simply no such thing as too much.