The Problem With Voice Datasets

Much of what NXP Semiconductors does is build and design chips. It then puts application software on top of these chips to enable customers to do things they ordinarily might not be able to do in house — such as develop voice recognition applications. NXP has worked with the Amazon Alexa and Google Assistant development teams to launch development kits for both platforms. In addition, the company is also working on different modules that make it easier to train smart voice devices.

NXP does not rely on a large, diverse, high-quality voice dataset to do this work, said Steven Tateosian, director of secure IoT and industrial solutions. Instead, NXP trains its voice applications in two ways:

It works with third party IP providers that process and create the models for specific words
It enables its products to work within the broader ecosystems such as Alexa Voice System or Google Assistant.

“There are companies that build these tools and create the models as their core competency,” Tateosian explained. “They will either custom create the data, or buy it or the client can send what data they have.” This route is not uncommon for companies needing datasets to train voice applications.

Testfire Labs is a natural language processing, machine learning and artificial intelligence company that is building productivity tools for the workforce, such as business meetings. It started collecting its own data before it even had an alpha of its product, said CEO Dave Damer. “Then we used datasets like the AMI Corpus, 2000 HUB5, VoxForge and others to see if they improved our word error rate and ROUGE scores. If we found sets that benefited the model, we used those alongside our own data and started doing data augmentation, getting multiple recordings of the same meetings and corrupting the quality of other recordings to increase the size of the data set.”

Ultimately it found that the best sources were the meetings from the beta period, combined with customers who opted to allow the company to use their meeting data for continuous improvement. This route is not uncommon for companies needing datasets to train voice applications.

Related Article: Github's Top Open Datasets For Machine Learning

A Rapidly Moving Space

Training products against voice datasets is a methodology that is rapidly evolving right now, Tateosian said. For that reason companies are taking different approaches based on what their end customers’ use case is and what the business model associated with their product is. But the biggest challenge for almost all of these companies, Tateosian said, is having enough data of a decent quality to create these models to train these machines.

One issue is that many companies are behind the curve compared with Amazon and Google, which have been collecting and creating giant datasets of different sounds and voices for years. Google makes some of its audio datasets publicly available, Tateosian noted. “My understanding based on conversations I have heard in the market is that they are an interesting place to start but they are not adequate if you are developing a production-level product that’s going to go on the market,” he said. “There is just not enough data or maybe it is not of the highest quality or diversity within the dataset. I have heard similar things about other public datasets — it’s a little bit like getting the "Cliff Notes" but not reading the book.”

Create Your Own Dataset

Another approach to finding the necessary data to train a product is to create your own dataset — a task that can be outsourced to companies, which NXP does. Company business models very much dictate the approach to this. Some of the outsourcers, for example, have a few thousand people kept on retainer who can say words or phrases in different ways with different accents. These words are then added to their already growing dataset.

Companies can build their own datasets from scratch if they are inclined, Tateosian said. “There is also a huge amount of audio data available online, such as with YouTube. The challenge is that it is not categorized in any particular way. YouTube isn’t set up, at least publicly, for someone to search for specific types of audio and then to be able to abstract that and create a model.”

Call centers are another source of datasets, he added. “Remember the frustration of 10 years ago, trying to talk to an automated call center. They have been learning and improving by recording and collecting all those calls. They have all the data within the context of whatever helpline that they are running.”

Related Article: 7 Tips to Better Voice Search Optimization

Learning Opportunities

WebinarJul 22, 2026 · 11:00 AM PDT

Replacing Tasks, Not Roles: The Changing Nature of Contact Center Work

Birds sitting on a tree branch like a content team

WebinarJul 23, 2026 · 11:00 AM PDT

How Fast-Moving Content Teams Keep Up as Sites Grow

WebinarJul 30, 2026 · 11:00 AM PDT

From Automation to Intelligence: How Leading Teams Are Rethinking Operations

WebinarAug 19, 2026 · 9:00 AM PDT

How to Win the War for Agentic Citations: The AEO Playbook You Need Now

Promotional banner for CX Retail USA Exchange 2026, an invite-only customer experience and retail leadership conference in Atlanta on Sept. 14–15, 2026.

ConferenceSep 14, 2026 · 7:30 AM EDT

CX Retail Exchange USA Atlanta 2026

Gaylord Rockies Resort & Convention Center in Aurora, Colorado

ConferenceNov 4, 2026 · 9:00 AM MST

Gartner Customer Service & Support Conference Denver 2026

Prove the significant result not only in soccer

WebinarOn Demand

Content Leaders Collective: Proving Content's Business Impact Starts With the Right CCMS

Watch Now

WebinarOn Demand

Why Some Dealers Are Pulling Ahead With AI

Watch Now

View All

Design Principles

Whether you build your own dataset or use publicly available ones, there are some important design criteria to keep in mind. “Ideally you need to support the mainstream business languages that your customers speak,” said Ed Price, director of compliance at Devbridge Group. “Take a page from the big boys' book on the block Amazon, Apple, Google, Microsoft — they don't support all languages / voices with their digital assistants. Unless you have a specific customer need, or your business depends on it, then don't.”

Also, Price advised that the company needs to focus on the jargon, abbreviations and industry speak that might be used by its customers in the voice user interface. He suggests conducting a design heuristics study of the customers and how they ask questions. “You could set up a prototype with Alexa or other device and capture the asks (speech-to-text) and then use that data to manage your features,” he said.

A Rapidly Moving Space

Create Your Own Dataset

Design Principles

About the Author