The CX Flaws Still Holding Voice Assistants Back

The Gist

Voice still struggles. Despite new features, Alexa Plus hasn’t fixed core problems like turn-taking or natural conversation flow.
Longer isn’t better. Smarter responses may impress in demos, but they don’t make voice assistants more useful or human in real-world use.
Basics remain broken. A decade in, voice assistants still interrupt, forget context and fail at simple conversation dynamics.

As someone who has worked on voice-based assistants since the late ’90s and was on the Amazon Alexa team when it launched in 2014, I can say with a measure of sober dismay that we are still nowhere close to solving the core challenge of voice-based, eyes-free, hands-free (often referred to as “far-field”) conversational interaction.

The latest buzz about “Alexa Plus,” Amazon’s upcoming and highly anticipated AI-powered upgrade, might sound like a quantum leap forward. But in truth, it’s likely going to be yet another high-gloss coat of paint over the same foundational flaws that have plagued far-field voice assistants from the very start.

Let’s be clear. Voice assistants like Siri, Google Assistant and Alexa have always been, at their best, information retrieval helpers or command processors that allow such engagement in eyes-busy, hands-busy situations. In such contexts, they excel at answering short queries (“What’s the weather in Bangor this coming week?”) or executing simple tasks (“Turn off the basement fan”).

But beyond that, the promise of natural, flowing far-field conversations has remained painfully unfulfilled. This is not for lack of ambition or technical talent, but because we still haven’t cracked fundamental challenges like conversational turn-taking, contextual memory and most crucially, endpointing. That’s the ability to discern when a user has finished speaking, especially during a pause, a hesitation or a mid-thought pivot.

Why Voice Assistants Can’t Hold a Conversation
More Features Don’t Mean Better Experiences
Voice AI Is Still Stuck on the Fundamentals

Why Voice Assistants Can’t Hold a Conversation

Anyone who has tried to make small talk with their smart speaker knows the frustration. Say something like “Play… umm…” and watch your assistant either cut you off prematurely or fill the silence with something irrelevant. It’s a problem that’s both technical and experiential, and it creates an experience that feels more like talking at a device than with one.

Alexa Plus, touted as “smarter than ever” and built on generative AI foundations, promises to be more conversational, proactive and humanlike. According to Amazon’s February launch event, this upgrade would allow Alexa to make restaurant reservations, learn your preferences, manage reminders with nuance and even suggest gifts for family members. There was even a moment in the demo where Alexa referred to itself as “your best friend in the digital world.”

That’s cute marketing. But is it smart marketing? I don’t think so. Based on my experience and the limited early access reports trickling out, it doesn’t look like Alexa Plus has solved the key user experience problems that matter. It may be capable of longer, more semantically rich responses. It may be plugged into better knowledge systems. But that doesn't mean it knows when to speak, when to listen or how to engage in real dialogue. And these are the very basics of what makes conversation feel human.

More Features Don’t Mean Better Experiences

A recent article by CNET highlighted how even early testers of Alexa Plus have reported that many promised features (i.e., ordering food with a conversation or recognizing family members visually) are still missing. The rollout appears to be quiet and staggered, perhaps to avoid the scrutiny Apple has recently faced over Siri’s repeated AI delays. And there’s still a lingering question no one at Amazon seems eager to answer: Are we finally getting an Alexa that can converse naturally, or just one that answers more complex questions with more flair?

The distinction is critical. An assistant that can spew longer, more informed responses might impress in demos, but it won’t necessarily be more useful or likable in real life. In fact, it might make the experience worse. Who wants to listen to 30 seconds of a voice assistant talking on and on? Voice as a medium is inherently . It is rewarded with adoption for its speed, clarity and conciseness. If Alexa Plus drones on like a podcast host delivering a monologue, it will be less effective.

Unless Amazon can deliver an experience that is both smarter and tighter, most users will feel like nothing much has changed. And they’ll be right. That’s because the real breakthrough in voice AI won’t come from better answers; it will come from better conversation. That means the ability to interrupt and be interrupted naturally. It means assistants that know when to wait, when to stop and when to listen. And it means systems that don’t require you to formulate your entire command in advance like a spell from a magic book, instead letting you fumble, pause and be human.

That’s not what Alexa Plus is promising to offer. At least not yet.

Voice AI Is Still Stuck on the Fundamentals

This continued failure to deliver on basic usability comes after literally a decade of iteration and billions in investment. The fact that teams brilliant enough to invent accurate far-field speech recognition have still not resolved conversational flow is, frankly, scandalous. It’s like building a self-driving car that can navigate highways flawlessly but still doesn’t know how to stop at a red light.

I say this not to diminish the technological achievements behind Alexa and its peers. Far from it. But I do want to temper the breathless hype with a dose of realism and a note of skepticism. Because despite the advancements allowed by generative and despite all the marvels that LLMs and foundation models have delivered, the final frontier in AI is still speech. This goes beyond language or text. It includes the way humans actually speak, in ways that are improvised, interruptible, fallible and shaped by context and emotion.

When Amazon first introduced Alexa, it was a genuine marvel. It made voice computing accessible and even fun. But in the decade since, the progress has been more cosmetic than structural. The Alexa of today still interrupts, still can’t remember what you told it yesterday, still makes you shout its name just to be heard, still mispronounces words and still won’t correct itself.

Learning Opportunities

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

Beyond Composability: How Modern Marketers Build Connected Experiences

Ready to launch campaigns faster, personalize smarter and prove your marketing ROI? Discover the power of a modern DXP.

Webinar

Dec

Unlock Connected Service: How to Forecast, Staff & Support Every Channel

Stop juggling tools. 73% of CX leaders say silos damage CX. Build a seamless service operation instead.

Webinar

Dec

Empowering Non-Profits: Smarter Crisis Communication and Community Engagement

The cost of miscommunication is measured in more than words. Learn to deliver outreach that's fast, clear and trusted.

Webinar

Dec

Roundtable: Turning Real-Time CX Signals into Business Results

Four big brands. One live, unscripted discussion on how modern CX teams move from dashboards to real impact.

Webinar

On demand

How to Build a Solid Knowledge Foundation for AI Success

See how leading brands keep their AI honest, compliant and actually helpful.

Watch Now

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

Beyond Composability: How Modern Marketers Build Connected Experiences

Ready to launch campaigns faster, personalize smarter and prove your marketing ROI? Discover the power of a modern DXP.

Webinar

Dec

Unlock Connected Service: How to Forecast, Staff & Support Every Channel

Stop juggling tools. 73% of CX leaders say silos damage CX. Build a seamless service operation instead.

If Amazon really wants Alexa to feel like a best friend (or even a tolerable coworker), it must do more than bolt ChatGPT onto a speaker. It must rethink voice interaction from the ground up. Until then, Alexa will remain what it has always been, a brilliant piece of technology that stops just short of brilliance in experience.

fa-solid fa-hand-paper Learn how you can join our contributor community.

Analyzing Alexa: Voice Is Not Ready for Prime Time CX

The Gist

Table of Contents

Why Voice Assistants Can’t Hold a Conversation

More Features Don’t Mean Better Experiences

Voice AI Is Still Stuck on the Fundamentals