Abstract dark king sketch on textured concrete wall background - Dark side of voice technologies concept.
PHOTO: Shutterstock

At the end of January Google struck a small blow against fraudsters that would seek to compromise voice applications. The backbone of voice technology is speech synthesis, also known as synthetic speech or text-to-speech technology. Essentially it harnesses speech interfaces to allow a human to interact naturally with digital devices. But this technology has a dark side: As Google explained in a blog post, it can be used by malicious actors to fool voice authentication systems, or to create forged audio recordings to defame public figures.

To counter this, Google announced that Google AI and Google News Initiative were making a synthetic speech dataset containing thousands of phrases spoken by its deep learning text-to-speech models available to participants in the ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge. ASVspoof 2019 is a global open challenge supported by several academic institutions around the world that invites researchers to submit countermeasures against fake or "spoofed" speech, with the goal of making automatic speaker verification ASV systems more secure. “By training models on both real and computer-generated speech, ASVspoof participants can develop systems that learn to distinguish between the two,” Google said.

A Unique Identifier

Such a development is crucial if voice applications and technologies are to continue to make their way into mainstream life. The voice is a unique identifier for an individual, according to Adrian Bowles, Lead Analyst for AI at Aragon Research. “The human voice has enough physical attributes that can be captured in digital form to identify individuals — similar to fingerprints,” he said. 

“This is the spectral frequency display — a view of a digital audio recording — of my voice speaking a common three letter word. With enough training data — listening to a recording of me over time, or having me speak a phrase like 'this is my banking password' — an application could learn how I speak well enough to recognize my voice by comparing my input to a stored profile with my digital voiceprint,” Bowles said.

Related Article: 9 Voice Datasets You Should Know About

The Growth of Voice Fraud

Because voice is a unique identifier it is considered a viable biometric measure for security. Increasingly banks are adopting voice authentication solutions for account holders; it is also a staple used to protect sensitive data in other sectors. But as voice authentication grows as a security protection method, so does voice fraud. According to a 2018 survey by Pindrop, the rate of voice fraud increased over 350 percent from 2013 to 2017. Today, voice fraud costs U.S. organizations $14 billion each year within call centers alone. 

“As more channels become voice-activated, fraudsters will leverage emerging technologies such as voice synthesis to impersonate consumers,” according to the Pindrop survey. “There is a wide variety of fraudulent approaches, from vishing or voice phishing, using automated recordings and other techniques to get callers to hand over personal data, to more nefarious approaches, such as creating an actual imitation of an individual's voice,” said James D’Arezzo, CEO of Condusiv Technologies. “This technology was demonstrated by Adobe in 2016 as VoCo, billed as a type of photoshop for voice, where capturing a few minutes of someone's speech can then be manipulated to make them say anything. Adobe has yet to release the product,” he said.

Another company, Lyrebird, has a product on the market that allows one to create an avatar using his or her own voice. D'Arezzo noted that Lyrebird has an ethics statement on its website, warning that its software “could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.”

For the most part, synthetic speech hacks aimed at live customer center agents are not good enough to fool a human, said Alexey Khitrov, CEO of biometric authentication firm ID R&D. “But they do appear good enough to fool many automated systems. So the threat is real, especially as more companies are creating products that synthesize speech.”

Up until recently the tools that fraudsters used to compromise voice were limited in that they require training using specific phrases or require a relatively large quantity of the person’s voice, Khitrov said. The goal of capturing speech is to capture all the phonemes in a particular language by that speaker, he explained. “Then the synthesizer software analyzes the phonetics in the speech and recreates the voice by combining phonemes using algorithms that know what phonemes to speak for particular words. The algorithms add inflection and emphasis to sound more real.”

However, recent advances in machine learning and neural networks have alarmed experts because they make it much easier for criminals to fool the voice recognition systems, said Douglas Crawford, Digital Privacy Expert at BestVPN.com — so much so that entire on-the-fly conversations can sound authentic enough to fool most modern biometric voice identification systems.

Related Article: 5 Examples of Voice-Powered Customer Experiences

The Industry Pushes Back

Fortunately, Crawford reported, a number of startups have sprung up that promise to help companies detect synthetic voice fraud. “These invariably use the same machine learning and deep neural network technologies to detect the use of machine learning and deep neural network technologies” used by fraudsters, he said. How effective these startups will be remains to be seen, Crawford concluded. “At the very least they should alert call centers about potential voice discrepancies to which they can respond with additional verification procedures.”