Interested in burning zombies?
Google Voice Search told me all about them when I accidentally activated it last week. I was listening to a “Morning Joe” discussion of the Democratic presidential candidates. Bernie Sanders name came up, but definitely no zombies, burning or otherwise.
While I’ve pocket dialed many a friend and family member, this was a first. It made me think about how far voice technologies have come over the years.
Can You Hear Me Now?
Despite giving me “burning zombies” results from my unintended Bernie Sanders inquiry, Google Voice Search is definitely my go-to virtual assistant. It not only understands my questions in context, using a superior speech recognition system, but it also gives the appearance of knowing everything.
There’s a reason that Google Voice Search is so smart. In a Slate article, Google’s VP of engineering Scott Huffman explains that deriving intelligence from information is a core part of Google’s mission as a company. Google’s database of “500 million things it knows about the world” is extensive and dynamic, constantly being updated by web crawlers.
Technology companies like Google are not only investing in data collection and analytics to ensure information response quality, but also investing in voice recognition and text-to-speech advances. Essentially all are attempting to teach computers to better understand and respond to the world the way (smart) humans do.
Google has a lot at stake here. Google Searches for its Future posits that based on the market move to mobile, Google is facing the biggest shift since its founding: “If Google doesn’t figure out how to make the perfect virtual assistant, another tech company will.”
Apple’s Siri is already automatically installed on hundreds of millions of iPhones and the Apple Watch. Microsoft’s Cortana is an integral part of its new operating system, Windows 10. And Amazon has released its smart Echo appliance that converses with you to execute your every command.
Brazen Heads and Drunken Swedes
Speech synthesis is far from a new idea.
One of the earliest attempts at synthetic speech occurred in the 18th century, when Wolfgang von Kempelen’s wood and leather speaking machine generated audible speech. It notably provided important data to early studies of phonetics.
The 12th and 13th centuries boast even earlier legendary speech synthesizers — “brazen heads” fashioned of brass or bronze, whether mechanical or perhaps magical.
The first computer-based speech synthesis systems were created in the late 1900s with notable contributions from IBM, Bell Labs and Digital Equipment Corporation. For decades one of the most well known was DECTalk with its recognizable, inaccurate and slightly inebriated robotic cadence often characterized as sounding like a “drunken Swede.”
Mobile electronics featuring speech synthesis began emerging in the 1970s. Advances since then have taken advantage of increased computer power to create much more natural sounding synthetic voices, and now have reached close to perfection using powerful concatenative and formant synthesis approaches.
Not Your Grandfathers’s IVR
Just as with speech synthesis, speech recognition technology has come a long way, though many would say the Interactice Voice Response (IVR) systems we deal with daily remain medieval. Who hasn’t experienced the infinite loop of “Please repeat, I did not understand your response”?
At first, speech recognition systems could only understand digits. In 1952, Bell Laboratories designed the "Audrey" system, which recognized digits spoken by a single voice. Ten years later, IBM demonstrated a machine at the World's Fair that could understand 16 words spoken in English. In the 1970s, the DARPA Speech Understanding Research program was responsible for Carnegie Mellon's "Harpy" speech-understanding system. Harpy could understand 1011 words, approximately the vocabulary of an average three-year-old.
Speech recognition blossomed with advanced processing capabilities in the 1990s. When DragonNaturallySpeaking was released in 1997 it could understand natural speech at about 100 words per minute. The technology improved steadily from that point — with the past two years seeing a dramatic improvement due to commercial speech recognition technologies driven by deep learning approaches combined with massive data sets.
As Speech Recognition Through the Decades explains, the bottleneck with speech recognition has always been the availability of data, and the ability to process it efficiently. The recent arrival of Google Voice Search for the iPhone demonstrates an ability to offload the processing for the app to cloud data centers to perform the large-scale data analysis necessary. And to this analysis the Google app can add data from billions of search queries, “to better predict what you're probably saying.”
These advances have carried over to personalization. Apple’s Siri draws what it knows about you to generate a contextual reply, and it responds to your voice input with personality. When I asked Siri the meaning of life, it told me "All evidence to date suggests it’s chocolate."
Not to be outdone, Microsoft has just announced the addition of Cortana music sensing to Windows 10 Mobile. I asked Google Voice Search about this and it told me that Cortana could help me figure out what song is playing, whether I’m in my car or at a bar, and then help me me get that song in Xbox Music.
The cool factor of music selection notwithstanding, important application benefits can be achieved from improved voice technologies.
Text-to-speech programs have enabled people with visual or other physical impairments to better “see” and “speak.” Voice activation and response applications are delivering entertainment and educational value in games and toys, and are serving well in more serious applications like those in motor vehicles. In the business environment, voice recognition is providing the ability to streamline workflows, drive productivity and even change how work gets done.
Interesting applications are emerging in the financial services industry. For example, Barclay’s and HSBC are rolling out voice recognition features for identity verification.
Applications for the healthcare industry show promise as well. Nuance is just unveiling its cloud-based Dragon Medical One Platform at HIMSS16. Described in a recent article, this speech recognition and documentation tool supports how physicians work and aims to redefine the relationship clinical users have with healthcare technology. For example, features will enable users to tap into evidence-based content using a smartphone as a secure microphone to dictate, edit and navigate the EHR (Electronic Health Record) on any workstation.
A Voice with No Name
Voice technologies have come of age. They are not only contributing to meaningful industry applications, but are even showing up at the Super Bowl.
Amazon’s 2016 Super Bowl commercial featured some big names — Alec Baldwin, Dan Marino, Jason Schwartzman, Missy Elliott … and “Alexa,” a cloud-based voice service that connects to Echo. The ad highlights all the things that Alexa can do, not the least of which is have a conversation as she completes all the tasks you’ve requested.
I think Alexa is pretty cool. I also think if we are moving toward a natural more human interface with our voice technologies, then having a name is important. I bet Alexa, Cortana, Harpy, Audrey and Siri would agree with me. So take a memo Google Voice Search: “you need a name and it just so happens that Deb is available.”