The Gist
-
Voice still struggles. Despite new features, Alexa Plus hasn’t fixed core problems like turn-taking or natural conversation flow.
-
Longer isn’t better. Smarter responses may impress in demos, but they don’t make voice assistants more useful or human in real-world use.
-
Basics remain broken. A decade in, voice assistants still interrupt, forget context and fail at simple conversation dynamics.
As someone who has worked on voice-based assistants since the late ’90s and was on the Amazon Alexa team when it launched in 2014, I can say with a measure of sober dismay that we are still nowhere close to solving the core challenge of voice-based, eyes-free, hands-free (often referred to as “far-field”) conversational interaction.
The latest buzz about “Alexa Plus,” Amazon’s upcoming and highly anticipated AI-powered upgrade, might sound like a quantum leap forward. But in truth, it’s likely going to be yet another high-gloss coat of paint over the same foundational flaws that have plagued far-field voice assistants from the very start.
Let’s be clear. Voice assistants like Siri, Google Assistant and Alexa have always been, at their best, information retrieval helpers or command processors that allow such engagement in eyes-busy, hands-busy situations. In such contexts, they excel at answering short queries (“What’s the weather in Bangor this coming week?”) or executing simple tasks (“Turn off the basement fan”).
But beyond that, the promise of natural, flowing far-field conversations has remained painfully unfulfilled. This is not for lack of ambition or technical talent, but because we still haven’t cracked fundamental challenges like conversational turn-taking, contextual memory and most crucially, endpointing. That’s the ability to discern when a user has finished speaking, especially during a pause, a hesitation or a mid-thought pivot.
Table of Contents
- Why Voice Assistants Can’t Hold a Conversation
- More Features Don’t Mean Better Experiences
- Voice AI Is Still Stuck on the Fundamentals
Why Voice Assistants Can’t Hold a Conversation
Anyone who has tried to make small talk with their smart speaker knows the frustration. Say something like “Play… umm…” and watch your assistant either cut you off prematurely or fill the silence with something irrelevant. It’s a problem that’s both technical and experiential, and it creates an experience that feels more like talking at a device than with one.
Alexa Plus, touted as “smarter than ever” and built on generative AI foundations, promises to be more conversational, proactive and humanlike. According to Amazon’s February launch event, this upgrade would allow Alexa to make restaurant reservations, learn your preferences, manage reminders with nuance and even suggest gifts for family members. There was even a moment in the demo where Alexa referred to itself as “your best friend in the digital world.”
That’s cute marketing. But is it smart marketing? I don’t think so. Based on my experience and the limited early access reports trickling out, it doesn’t look like Alexa Plus has solved the key user experience problems that matter. It may be capable of longer, more semantically rich responses. It may be plugged into better knowledge systems. But that doesn't mean it knows when to speak, when to listen or how to engage in real dialogue. And these are the very basics of what makes conversation feel human.
Related Article: Why Voice May Be the Ultimate CX Interface for AI Adoption
More Features Don’t Mean Better Experiences
A recent article by CNET highlighted how even early testers of Alexa Plus have reported that many promised features (i.e., ordering food with a conversation or recognizing family members visually) are still missing. The rollout appears to be quiet and staggered, perhaps to avoid the scrutiny Apple has recently faced over Siri’s repeated AI delays. And there’s still a lingering question no one at Amazon seems eager to answer: Are we finally getting an Alexa that can converse naturally, or just one that answers more complex questions with more flair?
The distinction is critical. An assistant that can spew longer, more informed responses might impress in demos, but it won’t necessarily be more useful or likable in real life. In fact, it might make the experience worse. Who wants to listen to 30 seconds of a voice assistant talking on and on? Voice as a medium is inherently . It is rewarded with adoption for its speed, clarity and conciseness. If Alexa Plus drones on like a podcast host delivering a monologue, it will be less effective.
Unless Amazon can deliver an experience that is both smarter and tighter, most users will feel like nothing much has changed. And they’ll be right. That’s because the real breakthrough in voice AI won’t come from better answers; it will come from better conversation. That means the ability to interrupt and be interrupted naturally. It means assistants that know when to wait, when to stop and when to listen. And it means systems that don’t require you to formulate your entire command in advance like a spell from a magic book, instead letting you fumble, pause and be human.
That’s not what Alexa Plus is promising to offer. At least not yet.
Related Article: Voice Search Optimization: The Role of AI in the New SEO Landscape
Voice AI Is Still Stuck on the Fundamentals
This continued failure to deliver on basic usability comes after literally a decade of iteration and billions in investment. The fact that teams brilliant enough to invent accurate far-field speech recognition have still not resolved conversational flow is, frankly, scandalous. It’s like building a self-driving car that can navigate highways flawlessly but still doesn’t know how to stop at a red light.
I say this not to diminish the technological achievements behind Alexa and its peers. Far from it. But I do want to temper the breathless hype with a dose of realism and a note of skepticism. Because despite the advancements allowed by generative and despite all the marvels that LLMs and foundation models have delivered, the final frontier in AI is still speech. This goes beyond language or text. It includes the way humans actually speak, in ways that are improvised, interruptible, fallible and shaped by context and emotion.
When Amazon first introduced Alexa, it was a genuine marvel. It made voice computing accessible and even fun. But in the decade since, the progress has been more cosmetic than structural. The Alexa of today still interrupts, still can’t remember what you told it yesterday, still makes you shout its name just to be heard, still mispronounces words and still won’t correct itself.
If Amazon really wants Alexa to feel like a best friend (or even a tolerable coworker), it must do more than bolt ChatGPT onto a speaker. It must rethink voice interaction from the ground up. Until then, Alexa will remain what it has always been, a brilliant piece of technology that stops just short of brilliance in experience.
Learn how you can join our contributor community.