Machine Learning Datasets: Build Or Buy?

IFI CLAIMS Patent Services has a global patent database with more than 110 million records from about 100 countries that the company has painstakingly assembled over the years. “We take information from different data sources and we standardize it and put it in a usable format that companies can either access directly or they can build a user interface on top of it,” Director of Marketing Catherine Suski said. Could the company have acquired this comprehensive dataset instead? Not hardly, according to Suski. There is nothing like it in the world, she said. Furthermore the company is continually adding to it and monitoring it for quality. IFI CLAIMS that it does checks to make sure the data is correct, which involves a lot of manual work by editors. There are more than 2,000 variations on the name IBM,” she said. “Our database standardizes all these variations to make it easier for customers.”

In our automated and connected economy, data has become the coin of the realm. Companies need it to market to their customers, to develop their products, to come to corporate decisions, to build and test their own apps. It is ubiquitous — or so it would seem. In truth, the right dataset can be as valuable as any hard asset that a company might own. “As much as we all feel inundated with information overload, ask any machine learning researcher and they will tell you that there isn’t enough data to learn,” said Mansi Singhal CEO of qplum. Essentially a company has three choices: it can build it, like IFI CLAIMS did, it can buy it, or it can do a little of both.

Whatever the company decides, Singhal said, it will not be cheap. “This is something that many other firms are facing as the big hurdle — how to create large datasets economically so that models can be trained and answers can be more accurate and high quality,” she said. But there is more than just cost to consider. Indeed a company has to weigh several — sometimes competing — factors as it makes a decision about the data it will use. It’s rarely a pure IT decision — i.e. choosing the most comprehensive dataset available. “It’s always a game of maximizing value,” said Natalie Robb, founder of Wavelength Analytics. “Like anything else, determining which dataset is best is based on trade-offs,” she said. Meaning the "best" dataset is one that fits the budget, meets the project’s data quantity and quality needs and time constraints.

“Let’s say you want to know consumer sentiment,” Robb said. “Amazon and eBay make customer review data available — and nothing is more comprehensive than Amazon’s customer reviews.” But, she continued, the dataset is enormous, which means you need to have the requisite machine learning and natural language processing tools and skills.

Ultimately, the question is not so much where to get the data, but which kind of dataset could support your business, said Majken Sander, business analyst and solution architect with TimeXtender. “No matter your goal, there’s a dataset out there to support you.” As companies dive into this process, here are some arguments for each possibility.

Building It

Sometimes a company has a specific purpose in mind for the data that cannot be satisfied by anything on the market. At qplum, for example, the company creates its own data to train its chatbot that answers clients’ queries about the service and to run a financial assessment for them, Singhal said. “There is no other way for us to do this apart from creating data through manual and automated ways along with learning from the user base coming to our platform, given the specific use case.”

Also, a company may be surprised at the depth of data it can find from everyday sources to tap to build a dataset. Manish Sharma, founder of Contactous, is a big believer of building datasets using such data. For example, he said, that his organization has:

Built a contact database of more than 30,000 records within a month from business cards collected by sales reps over the years for a B2B software organization.

Constructed a database of 800,000 consumers over a 6-month period by digitizing manually written warranty cards for a reputed company in consumer electronics. “A small home maintenance organization with minimum automation started to scan the 2,000 service reports to create their own CRM database within a week,” Sharma said.

But the most valuable data a company has is undoubtedly what it knows about its existing customers and prospects, said Nick Worth, CMO at Selligent Marketing Cloud. “Deepening that understanding with a powerful marketing platform and augmenting that critical first-party data with external sources is the right place to start to both deepen existing customer relationships and develop look-a-like models for acquiring more customers,” he said.

Related Article: Data Ingestion Best Practices

Acquiring The Data

The website The Penny Hoarder is acquiring smaller datasets through third parties that the site may be using for something else, according to IT Director Stephen McDermott. “For example most web analytics and audience analytics vendors provide APIs and it is common for those APIs to provide data which is not aggregated, but specific. They provide valuable insights through their own dashboards to our business teams, but also have access points for more detailed data.”

There are a number of sources from where companies can buy data, Robb said. There are aggregators such as Lotame, Gravy Analtyics or Foursquare. There are enterprise data sources such as Hoovers, DiscoverOrg, ZoomInfo and Lead 411. A company can get enterprise buyer intent data from companies such as Bombora or 6Sense. Finally a company can buy another organization’s first party data from marketplaces or exchanges, Robb said.

Learning Opportunities

WebinarJul 22, 2026 · 11:00 AM PDT

Replacing Tasks, Not Roles: The Changing Nature of Contact Center Work

Birds sitting on a tree branch like a content team

WebinarJul 23, 2026 · 11:00 AM PDT

How Fast-Moving Content Teams Keep Up as Sites Grow

WebinarJul 30, 2026 · 11:00 AM PDT

From Automation to Intelligence: How Leading Teams Are Rethinking Operations

WebinarAug 19, 2026 · 9:00 AM PDT

How to Win the War for Agentic Citations: The AEO Playbook You Need Now

Promotional banner for CX Retail USA Exchange 2026, an invite-only customer experience and retail leadership conference in Atlanta on Sept. 14–15, 2026.

ConferenceSep 14, 2026 · 7:30 AM EDT

CX Retail Exchange USA Atlanta 2026

Gaylord Rockies Resort & Convention Center in Aurora, Colorado

ConferenceNov 4, 2026 · 9:00 AM MST

Gartner Customer Service & Support Conference Denver 2026

Prove the significant result not only in soccer

WebinarOn Demand

Content Leaders Collective: Proving Content's Business Impact Starts With the Right CCMS

Watch Now

WebinarOn Demand

Why Some Dealers Are Pulling Ahead With AI

Watch Now

View All

The choices can be overwhelming and the prices and methods vary across the board, she said. “It’s crucial to know what you are buying. For example, you need to know how much data is modelled. This means, how much extrapolation do they do to populate the datasets. Also, how much is a dataset validated?”

A Combination of Both

The best options for building datasets are connecting and blending public and private, said Christopher Penn, co-founder of Brain+Trust Insights. For example, he said, a healthcare company will have its own proprietary data, but it may not be rich in depth. This company could go to Medicare or other government websites and append additional relevant data to its dataset to beef it up for statistical analysis, data science, machine learning and AI.

Last year, Penn did a test with the AHRQ Medicare Hospital Quality Dataset, which he said had some serious gaps in it. "I blended it with the Bureau of Labor Statistics and Census Bureau data to enrich the AHRQ dataset and ‘complete’ it with imputation,” he said.

Building It

Acquiring The Data

A Combination of Both

About the Author