couple talking at table with laptops between them
PHOTO: Brooke Cagle

Computers, whether personal computers or smartphones, have historically needed screens to be meaningful and useful. With screens comes the necessity of effective input modalities. We know we need some way of pointing at things shown to us (typically using a mouse or touch), as well as some way of entering text or issuing commands (typically using a keyboard). Working with a computer entails interacting with it, and the key lies in finding the most efficient ways to accomplish both input and output on the device.

Amazon Echo Brought Voice to the Forefront

In 2014, Amazon surprised the world by placing a computer device, the Amazon Echo, into our homes that didn't have a screen, a keyboard, a mouse or touchpad. The idea was radical, as it meant voice was the only interface for input and output into the device. While you could argue that Echo was (and still is) inherently special-purpose, not meant to replace our (portable) computers, it covered a variety of things we frequently do with our more capable devices: check the weather or the news, get answers to straightforward questions, set timers, reminders, play games, control our smart home, communicate with others, and above all: listen to music. The latter activity alone would have justified the existence of the device. Saying “Alexa, play the Beatles” when we are in the mood for it trumps any other way of accomplishing the same, such as walking towards it, selecting artist search, and starting to type the first letters of the band’s name, then selecting an album or station, and hitting "play."

The advent of the Echo triggered a new technology hype cycle, with all its accompanying bells and whistles. Developers jumped at the opportunity to apply their coding talent to an interface that was largely new to them (ignoring the fact that voice has actually been around for a long time). They attempted to rebuild well-known mobile apps or web experiences in the form of voice interactions, without taking the peculiar features of language-based interfaces into account. The first implementations were as expected: clunky, difficult to maneuver, often short-lived, as usefulness was limited. A parallel development saw similar struggles: chatbots on messaging channels such as Facebook Messenger.

Adding to the confusion was the industry’s inability to find a common term for what to call these voice experiences. Amazon calls them “skills,” Google calls them “actions,” Samsung even picked a highly technical — and clunky in itself — term with “capsules” for its Bixby voice assistant. Apple doesn’t have a third-party ecosystem for Siri yet. In typical Apple tradition, the company is taking its time to think through how this new input modality should really work, to make it stick for users. Ironic, given it was the first company to launch a voice assistant in 2011, with Siri being the last major technology introduction before Steve Jobs’ untimely passing away.

Related Article: Amazon Echo vs. Google Home: Who's the Real Winner in Voice

Voice's Ongoing Struggle With Stickiness

Fast forward six years, and home assistants — with voice as the sole computer interface — have become the fastest growing consumer electronics category in history. What’s striking to observe, though, is the relative slowness of adoption and stickiness of the skills or actions, the community-driven enhancements, which worked so well in the case of mobile apps. We still use our home assistants primarily for playing music, setting reminders or asking knowledge questions. In my own household, we have not adopted a single skill outside of what the Amazon Echo provides out of the box. Why is that?

The community created the hashtag #VoiceFirst to advance the idea of voice as a dominant channel of communication. They argue it is the most natural channel of all, the channel closest to human nature, as it is the primary channel evolution has brought about to enable communication. And it certainly is! But then we shouldn't we all make an effort to curate our own personal selection of voice skills/actions and make frequent use of it. We definitely aren’t.

This lack of stickiness should give rise to the realization that maybe voice isn’t another “platform,” as some postulate. It isn’t something that can, by itself, entertain an entire industry and ecosystem of voice-only developers, or companies focusing on nothing but building voice experiences for other businesses. It is a modality first and foremost, with peculiar features that needs specialists, so-called voice user interface (VUI) designers, to craft meaningful experiences. It will not become another platform such as the web or mobile in terms of size.

Some blame things on the problem of “discoverability.” In the absence of a screen, with its innate ability to remind us of features through a display of available program or app icons (in the app store, or on our home screens), it is upon us with voice to remember what the device can do for us. In theory, voice should allow us to “just ask” — given we have a true need. In theory, the device should be able to guide us through available skills by means of a dialogue. In practice, however, that is where things fall apart.

Related Article: The Problem With Voice Datasets

The Problem With Voice? It's Both Fast and Slow

Voice can be incredibly frustrating as a channel. It is inherently slow when used in dialogue. The paradox of voice is it is fast as an input channel — we can speak three times faster than we type according to a 2016 Stanford study. Yet it is (surprisingly?) slow as an output channel. 

We listen to podcasts at 1.25x or 1.5x speed for a reason. When looking at a complex website such as Amazon.com, we typically grasp its macro-structure instantly. But imagine listening to someone reading back the contents of the page over the phone. It would take many minutes, and many web pages make ample use of imagery. Add to that the fact that dialogue is error-prone. While speech recognition is growing increasingly proficient at transcribing speech to computer-processable text, the interpretation of that text is the tougher task, as that deals with the inherent ambiguity and complexity of language. And when it comes to having a real back-and-forth dialogue with our devices, we are still in the very, very early days. Today’s implementations are mostly simple question and answer systems.

Would progress in dialogue really help? Suppose we combine the knowledge represented on the WWW with the conversational intelligence of a human being, and make that available through the body of a voice-only home assistant. We would still not have a device that would be able to help us with many of our daily needs. Too much depends on the ability to show us things, not just talk about it. “A picture says more than a thousand words.”

Related Article: Which Voice Experience Do You Need? Choosing Between IVA and VUI

The Most Efficient Interface Is Multimodal

Voice for output works well when the answer is short and succinct. Yet it fails when the answer is complex or verbose. As an example, consider the answers to questions such as: “what are this year’s Oscar nominations” (a table), or “which countries qualified for the Eurovision Song Contest” (a list), or “what are the cuts of beef” (a list, optimally accompanied by an image). Search engines, when used on devices with a screen, know which format to present to best answer the given question. With voice only, answers either fall short, or become too cumbersome to process.

Voice for input works well when we need to enter text. Yet it fails for pointing or selection, i.e. as a replacement for touch. Consider a shopping scenario on a voice device with a screen, such as the Amazon Echo Show. After saying “I’d like to purchase a blender,” the device shows a horizontal list of candidates. The dialogue then likely continues like this: “Alexa, scroll right. Scroll right. Show number 8. Show details. Scroll down. Scroll down. Go back. Scroll right .…”

The future must be multimodal. We need efficient interaction. Over the years of using smartphones and newer forms of devices, we have not only gotten used to efficiency and convenience, we've come to expect it. When it comes to interfacing with computers, different modalities have different strengths. If we want voice to stick, we need to use it where it's efficient, and complement it with screens and touch where it isn’t.

It won’t be enough to tell users “Have a look at the Alexa app on your phone for more information” as part of an answer. What if my phone isn’t nearby? It might even just be a few steps away, yet it is that level of convenience we got trained on that makes us reject getting up and fetching it as a viable option in some situations, such as when sitting at the dinner table with your partner.

Furthermore, we need to resist the temptation to force an interaction into a voice-only paradigm where it might not belong. Only a fraction of what makes sense on the visual web makes sense through a voice-only interaction. Attempting to do so nevertheless, say for the sake of just “being on Alexa because it’s cool,” might backfire and leave a bad impression of your brand. 

Case in point: I recently attended a VUI design workshop at the Project Voice conference in Chattanooga, Tenn. A representative from a health insurance firm told of a project they were embarking on to build a voice skill for finding in-network doctors. It didn’t take long for the audience to realize that that might be that classical case of “doesn’t belong here.” The process of finding a doctor inherently contains a level of complexity of personal preferences that isn’t easily representable through voice alone. The conclusion was that voice could potentially serve as the entry-point, but the interaction would ultimately have to include a visual modality if you don’t want to frustrate your customers.

Designing the Multimodal Future

Building effective multimodal interfaces isn’t trivial. It needs the joint expertise of both VUI and GUI designers working together to combine modalities into something new. The forms prevalent on devices today, such as the Amazon Echo Show, are limited to showing lists of elements to chose from (such as when purchasing products, or searching for recipes to cook), then displaying text and visual information. True multimodal interfaces, however, need to reflect the complexity that modern websites show, with forms consisting of radio buttons, checkboxes, drop-down lists, as well as quick ways of typing commands or entering text. We need the equivalent of full-fledged tablet experiences. The screen of the Echo Show needs to become detachable, so we can temporarily hold it in our hands if needed for faster input.

The concept of readily available screens needs to become pervasive in our households. The TV in the living room needs to become another form of home assistant screen — instantly on when a question in the household comes up. The dining table needs a readily accessible multimodal device. The kitchen needs it. Our home offices need them. The world of special-purpose screens in our lives such as TVs, computer monitors, laptop screens, smartphone displays, alarm clocks, or the first generation of home assistant displays, needs to evolve into a world where screens become always-on (or instant-on), and constantly connected to the assistant ecosystem of our choice (Amazon, Google, Samsung, Apple …).

When multimodal interfaces become as pervasive as Echo Dots or Google Home Minis have become in many households today, we have achieved the level of efficiency and convenience we have only been shown glimpses of to date.