Sundar Pichai Google Duplex demo

Google Duplex, the artificial intelligence system (AI) revealed by Google CEO Sundar Pichai at the Google I/O developer conference earlier this month has been perceived as an outright miracle by some and an almost existential threat by others.

With Duplex, Google Assistant can make phone calls on behalf of users to do things like schedule appointments or make reservations. I provided my analysis of the technology and the implications behind it in a recent piece on Voicebot.ai. Now I’d like to focus on the impact this technology could have on businesses aiming to provide a better experience for customers calling their contact centers.

Many customer experience research firms report a growing number of consumers prefer using self-service systems over talking to agents in contact centers. Combine that trend with the entry of speech technologies into the mainstream, and you can cut costs by deflecting calls from your call center while improving the customer experience (CX).

The specific opportunity lies in adding many more self-service functions than you’ve dared to deploy in your contact center on your interactive voice response (IVR) system or voice portal to date — with the promise that the technology is now mature enough to provide an exceptional experience.


Related Article: 4 Questions to Ask Before You Send in the Chatbot

Google Duplex Opens Up an Opportunity Too Good to Pass Up

Recent technological developments present a unique opportunity to revisit the first impression that customers get from your company’s hotline.

Traditional IVR systems (“For support press 1, for sales press 2 ...”) have earned themselves a bad reputation because of their focus on keeping customers from speaking to agents. While IVR systems have moved toward a better customer experience in recent years, it’s the amazing advances in speech recognition of the past 24 to 48 months that present an opportunity too good to pass up.

Your voice self-service or IVR system might not have the ability to code itself quite yet, but the opportunity to revisit how your company presents itself when called is one you cannot ignore. When was the last time you called your own company’s central toll-free number? If it has been a while, search for that number online now and give it a try. Then watch that Google Duplex demonstration again (even though the roles here are of course reversed). Hear the difference?

Many commentators have brought up the idea that soon bots could speak to bots — i.e. businesses could deploy such systems themselves to answer our AI assistants’ calls. That is known as voice self-service applications, or voice portals, and it is technology businesses have deployed almost for decades (from “Say balance or press 1” in the early days to “In a few words, please describe the reason for your call”). In and of itself, it is nothing new.

Related Article: A Good Chatbot Is Hard to Find

What's Changed in the Speech Front?

With the proliferation of deep learning algorithms (a particular form of machine learning) for finding patterns in speech signals and relating them to words, we can now continuously improve these systems to perfection, even for 8-kilohertz telecommunications systems. It’s a function of training data — and if there’s one thing we have enough of these days, it’s data. 

In theory (and most likely soon in practice), deep learning algorithms will achieve higher recognition accuracy than humans. We have all come to accept this technology as performing “well enough” for everyday use in our home or mobile assistants, such as Alexa, Google Assistant, Siri, Cortana or Bixby. Google and many other technology companies, including Amazon, IBM and Microsoft, offer their newest speech-to-text engines as cloud APIs ready for you to use. That includes the other technology needed to build these more natural systems: text-to-speech, so that the voice on the phone sounds natural.

In the past, engineers had to guide the technology by providing a so-called “grammar,” which would spell out all the words (in the expected order) that we assumed a person would say at each dialog turn. While not much more laborious than any other kind of programming, it did require specialists with experience in language technology and, above all, a good sense for how humans would respond to a system’s prompts and messages. Another approach, called statistical language models, was more in line with today’s algorithms in that it was based on big data, but developers had to provide their own training sets, and those often turned out to be too expensive or impossible to use in practice because application changes over time involved large follow-up costs.

With the emergence of the newer approaches to speech recognition, we have seen an onslaught of startups that build their own speech-to-text products by using open-source machine learning algorithms and training data that is often available for purchase. Incumbent vendors that have dominated the market for decades now see their core business at risk (and some are rushing belatedly to launch their own versions of the technology).


Related Article: Machine Learning or Linguistic Rules: Two Approaches to Building a Chatbot

Text-to-Meaning: The New Frontier

Note how I have refrained from using the term “understanding” so far when talking about the technology. As I described in the article “Don’t Confuse Speech Recognition With Natural Language Understanding When Talking Bots,” speech recognition, or speech-to-text, is only one of two steps necessary to power voice AI. The focus has now shifted to getting better at extracting meaning from the words the system tells us it heard — something the industry seems to have agreed to call natural language understanding (NLU), or text-to-meaning.

Building an NLU system is still a challenge. Consider this example: In the context of a conversation with a bank’s self-service system, the words or phrases “balance,” “How much do I have in my account?” and “Check my cash” all essentially mean the same thing — namely “ACCOUNTBALANCE” (as one possible way to represent the meaning for further processing within a dialog system). Yet the words used to express the same meaning are entirely different. While machine learning can also help with this, the words themselves are often not enough of an indicator for an unambiguous classification of user intent. For example, “balance” could mean something entirely different if the context was, say, sports or nutrition.

So building a good NLU system is clearly not easy, but tech vendors will now give the challenge enough attention to improve the technology at a fast pace.

As mentioned before, the opportunity ahead lies in adding many more self-service functions than you’ve dared to deploy in the past — and your customers will welcome those new functions. And thanks to the division of speech-to-text and text-to-meaning, you can take the investment and deploy self-service not only over the phone, but also on the newer smart home assistants such as Amazon’s Echo devices or Google’s Home products, or even in messaging-based channels such as Facebook Messenger or your web chat tool.

Your contact center vendors now have a chance to step up their game and help you give your hotline new life.