Voice technology has become part of our daily lives. We walk into our kitchens every morning and tell our smart speakers to read us the headlines, or we make a shopping list via voice command while we are in the middle of cooking dinner. 

The technology is everywhere, and not just in smart speakers. It’s answering our phones, carrying out our internet searches, giving us directions as we drive, and even translating our speech in real time. Recently at the second annual Voice Summit 2019 in Newark, NJ, I spoke with Shing Pan, VP of Marketing and Business Development for Speech Morphing, Inc., about the company’s advances in voice technology.

Related Article: The Alexa Conference Merged Business With ... Human Kindness?

AI Modeling the 'Music of Human Speech'

“Speech Morphing is developing the next generation of speech synthesis,” said Pan. “If you look at today’s technology it tends to focus more on making the speech as natural as possible. But they don’t explicitly model the intonation and prosody of human speech. Prosody is the music and rhythm of human speech.”

You can train an artificial intelligence algorithm to sound like a human instead of a robot, but it’s a lot harder to get it to be believable. That’s because so much of our speech is nonverbal — including things like body language and intonation.

“When people talk face to face, only about 35% of it is actually based on what’s spoken,” said Pan. “The other 65% is based on body language and extra linguistic cues like eye contact, postures and tones. When you conduct a conversation virtually, you lose all the extra linguistic cues. When synthesized speech doesn’t use that you lose almost everything.”

Those robotic-sounding human voices seem to be everywhere these days. Voice actors will record much of what is being said so the AI can read from a prompt and respond appropriately. But it still doesn’t quite sound human, to the detriment of a brand’s image.

Related Article: The Problem With Voice Datasets

Bringing Emotion to AI Voices

“Voice has been used everywhere, not just in recent years, especially in industries like banking and healthcare,” Pan continued. “A lot of text to speech tends to sound robotic. It has come a long way but it hasn’t reached a naturalness that these industries want to use. As people tend to embrace AI or bots, they want to be able to use their own voice talent they can use across multiple channels. This is not easy with the existing technology, so they will have a talent voice one place and a bot voice somewhere else. A voice is an extension of the brand.”

Each time a company wants to change prompts or add to the AI’s knowledge base, they have to bring voice actors back for more recordings. Speech Morphing’s technology makes it possible to synthesize many of those changes.

Learning Opportunities

“The way we use speech synthesis we can use a very small recording to train the voice to create it very cost effectively,” said Pan. “Right now we require about 30 minutes of recording, but after our technology refresh we could do this with minutes to as little as 22 seconds.”

“Even with voice talent, sometimes the voice changes or they aren’t accessible for a refresh,” she continued. “When the prompt changes, it’s not economical to go back and record them again. Our technology makes a custom voice more economical so we can leverage the existing talent voice with a digital version to create any type of custom voice.”

When you call a call center and an AI voice asks you what your call is about, it can’t hear that you are upset or angry and respond appropriately. It can only listen to your words and repeat words back to you, which can be frustrating if you have a problem.

“Our voice can also be modified,” says Shing. “Because we captured all of the prosodies and human aspects of a voice talent, that makes us able to modify it down the road. Because we captured all the human aspects, we can create a context-based voice, from an angry voice to a happy voice to a sports style to a sales pitch style. If the customer is very angry, giving them a single monotone voice is not going to help them. We can give them an emotionally appropriate voice to help improve the customer experience.”

The next generation of voice technology is right around the corner and it’s got the potential to change everything.

To hear more about Voice Summit 2019, read my recap.