The Gist
- AI beats doctors at diagnosis. Microsoft’s MAI-DxO system solved 85.5% of complex cases, a fourfold improvement over human doctors.
- Orchestration is the secret weapon. The system uses multiple specialized AI agents debating and collaborating in real time to improve accuracy and efficiency.
- Doctors still matter — just differently. Suleyman says AI can enhance education, reduce costs and anxiety, while humans retain the role of empathetic guide and judgment provider.
As AI models get commoditized, the value will be added in that final layer of orchestration, Microsoft AI CEO Mustafa Suleyman says.
Microsoft earlier this month announced it built an AI diagnostician that outperforms human doctors on complex cases.
The system, called MAI-DxO, uses two bots to sort through a patient’s medical history and solves 85.5% of patient cases when paired with OpenAI’s o3 model. The results are a major leap above the 20% average accuracy that human doctors achieved on the same cases, although the humans were restricted from searching the web or speaking with colleagues.
In an in-depth conversation shortly after Microsoft announced the results, Microsoft AI CEO Mustafa Suleyman shared how the AI diagnostician was able to 4X the performance of human doctors, what it means for the future of medicine, and whether this is a positive trend for society.
You can read our full conversation below, edited lightly for length and clarity.
Table of Contents
- AI-Driven Search Transforms Healthcare Queries
- Inside Microsoft’s Two-Bot Diagnostic System
- The Orchestration Layer Is Where Value Emerges
- MAI-DxO Beats Human Doctors by 4X in Accuracy
- Doctors Can Learn From AI’s Diagnostic Thinking
- AI Can Detect Rare Diseases Doctors May Never See
- Why Training Data Doesn’t Explain This Performance
- Why Orchestrators May Outperform Single Models
- Reducing Cost and Test Anxiety Through Smarter AI
- Current Limitations and Long-Tail Use Cases
- AI Won’t Replace Empathy and Human Guidance
- The Doctor’s Role Is Evolving, Not Disappearing
- Beyond Healthcare: Orchestrated AI for Any Field
- What’s Next for MAI-DxO in Clinical Settings?
AI-Driven Search Transforms Healthcare Queries
Alex Kantrowitz: Hi Mustafa, good to see you again. First off, Copilot and Bing now field 50 million medical queries per day. Is that good?
Mustafa Suleyman: It's incredible, because we're already making access to information super cheap and concise with just search engines. And now with Copilot, answers are much more conversational. You can tone them down so they suit your specific level of knowledge and expertise, and as a result, more and more people are asking Copilot and Bing health-related questions.
The queries range from anything from a cancer issue that someone's dealing with, to a death in a family, to a mental health issue, to just having a skin rash. And so the variety is huge, but obviously we've got a really important objective here to try and improve the quality of our consumer health products.
Do the health questions that come into chatbots look different from search?
Copilot's answers tend to be more succinct and responsive to the style and tone of the individual person asking the question, and that tends to encourage people to ask a second follow-up question. So it turns it into more of a dialog or a consultation that you might end up having with your doctor. So they are quite different to a normal search query.
Inside Microsoft’s Two-Bot Diagnostic System
Speaking of dialogs, let's discuss Microsoft’s new AI diagnostician system. It's actually two bots, where one bot acts as a gatekeeper to all a patient's medical information, and the other asks questions about that history and makes a diagnosis. You’ve found the system performs better than humans in diagnosing disease.
That's exactly right. We essentially wanted to simulate what it would be like for an AI to act as a diagnostician, to ask the patient a series of questions, to draw out their case history, go through a whole bunch of tests that they may have had — pathology and radiology — and then iteratively examine the information that it's getting in order to improve the accuracy and reliability of its prediction about what your diagnosis actually is.
We actually use the New England Journal of Medicine case histories, hundreds of these past cases. One of these cases comes out every single week, and it's like an ultimate crossword for doctors. They don't see the answer until the following week. And it's a big guessing game to go back through five to seven pages of very detailed history, and then try to figure out what the diagnosis actually turns out to be.
The Orchestration Layer Is Where Value Emerges
I thought one of the benefits of generative AI is it can take in a lot of information and then come to answers — often in one shot. What’s the benefit of having multiple bots sort through it?
The big breakthrough of the last six months or so in AI is these thinking or reasoning models that can query other agents, or find other information sources at inference time, to improve the quality of its response. Rather than just giving the first best answer, it instead goes and consults a range of different sources, and that improves the quality of the information that it finally gets to. So we see that this orchestrator, which under the hood uses four different models from the major providers, can actually improve the accuracy of each of the individual models. And collectively, all of them together by a very significant degree, about 10% or so. So it's a big step forward. And I think that as the AI models get commoditized, really, all the value will be added in that final layer of orchestration, product integration, and that's what we're seeing with this diagnostic orchestrator.
MAI-DxO Beats Human Doctors by 4X in Accuracy
So it’s a 10% increase in accurately diagnosing on top of the standard LLMs?
Yes. And in fact, we actually benchmark that against human performance. So we had a whole bunch of expert physicians play this simulated diagnosis environment game, and they, on average, get about one in five, right? So about 20%. Whereas our orchestrator gets about 85% accuracy, so it's four times more accurate, which, in my career, I've never seen such a big gulf between human level performance and the AI system's performance.
Many years ago, I worked on lots of diagnoses for radiology and head and neck cancer and mammography, and the goal was just to take a single radiology exam and predict, does it have cancer? And that was the most we could do. Whereas now it's actually producing a very detailed diagnosis, and doing that sequentially through this interactive dialog mechanism. And so that massively improves the accuracy.
Doctors Can Learn From AI’s Diagnostic Thinking
What if you have the same thing happen to medicine as is happening with beginner level code, where people learn to code using copilots, but when something breaks, it becomes harder for them to figure out what's going on. If you're a doctor, if you outsource some of your thinking to these bots, is that a problem?
So this isn't just giving a black box answer. That's why the sequential diagnosis part is so important, because you can watch the AI in real time, ask questions of the case history, get an answer, shape a new question, get an answer, present a new question, then ask for a different type of testing, get those results, interpret it, then give an answer.
The dialogic nature means that a human doctor can follow along and actually learn in a very transparent way. It's almost like having an interpretability mechanism inside the black box of the LLM, because you can see its thinking process in real time. And in fact, you don't just see the chain of thought which is the inner monologue.
We've actually created five different types of agent which all have a debate, and we call this chain of debate. They negotiate with one another. They try to prioritize certain different aspects, like cost or efficiency. And the coordination of those different skill sets among the agents is actually what makes this so effective.
How MAI-DxO Works and Why It Matters
A summary of Microsoft AI CEO Mustafa Suleyman’s key points about the diagnostic system.
Feature | Description | Impact |
---|---|---|
Dual-bot architecture | One AI agent retrieves patient data; another interrogates the case and makes the diagnosis | Simulates a clinical dialogue for better accuracy and transparency |
Agent orchestration | Five specialized agents debate and collaborate during diagnosis | Boosts accuracy by 10% beyond LLMs; enables nuanced decision-making |
Performance vs. humans | 85.5% diagnostic accuracy compared to 20% for expert doctors | Fourfold improvement over average physician performance |
Educational value | Doctors can follow AI's logic and learn from rare case detection | Enhances medical training and clinical exposure |
Cost optimization | AI selects minimum necessary tests to reach accurate diagnoses | Reduces test burden, patient anxiety, and overall care costs |
Future use | Still in research phase; potential for hospital integration | Aims for wide deployment across medical platforms and queries |
AI Can Detect Rare Diseases Doctors May Never See
Even if a doctor can watch this take place, it turns their role in diagnoses from active to somewhat positive. Is there some benefit in having the doctor work in that active phase vs. watching bots have a conversation?
I think that's totally true. I just still think this is going to be an amazing education tool for doctors to actually learn about the breadth of cases they never would have encountered. For example, we actually ran the orchestrator last week on the most recent case study in the New England Journal of Medicine, and it correctly diagnosed the case that had only ever been seen 1,500 times in all of medical literature. It was such an obscure, long tail disease, so very few doctors are ever going to get the chance to see that. And so the ability to accurately detect these kinds of conditions in the wild, in production, I think, will massively outweigh the risk of doctors not being able to sort of exercise in the way that you describe.
I think the tools just change how you work. And everyone will have to adapt to that over time, but the utility is just so unquestionably beneficial that I think it makes it worthwhile.
Why Training Data Doesn’t Explain This Performance
Is it able to do that because the cases are in the training data. And even if they are, does it really matter?
Well, part of the reason why we partnered with the New England Journal of Medicine is because each week, they put out a brand new case which has never even been digitized. So there's no question that it's not in the training data. This case, for example, from last week — there's absolutely no way it's in the training data, because it literally just got published. So we think that's the case going back for all of the previous cases too. So I don't think there's any chance of that. This really is doing an abstraction of judgment. It's not just reproducing training data. It is actually doing some kind of inference or thinking based on the knowledge that it does already have.
Why Orchestrators May Outperform Single Models
Your system didn’t show as big of an improvement over reasoning models as it did over standard LLMs. Is it possible that state-of-the-art reasoning models will learn how to do stuff like this, and you won't need this type of specialized sequencing to achieve similar results?
The real value here, long term, is in how you orchestrate a variety of different models with different types of expertise. So each one of these five agents has been prompted and designed to have a different type of expertise and then have them jointly, negotiate, and reason collectively. It may be the case that maybe they all get subsumed into a single model in the future. I don't know. Right now, it doesn't look like that. Right now, orchestrators are able to drive much bigger gains.
The other thing that we see, for example, is that it's able to optimize for cost as well, and reduce the cost by avoiding unnecessary tests versus the humans. So that's a function of cost being factored into the orchestrator at inference time, which wouldn't be something that you could reconcile inside of a single model in pre-training or post-training.
In medicine, cost is a factor. You know, you could order every single test and probably do better diagnosing people, but it's just not a reality today. And it is interesting to watch the bot work through which tests to order and then come in at a lower cost than typical doctors.
More tests also make people feel anxious. So it's not just cost, but it's actually the patient experience that gets optimized for as well.
Reducing Cost and Test Anxiety Through Smarter AI
And so how does it decide which tests to order and how to optimize cost?
What the model is trying to do is to get to the best diagnosis with the minimum number of tests. The model has much broader range of awareness of which test results tend to correlate with which particular diagnostic outcomes. And so given that it's seen so many more cases than any given human, it's showing that it can do a better job of judging, in this instance, given this case history that it already knows about a patient, what is the minimum number of necessary tests to get the next piece of information to be able to continue the diagnosis and get it more accurate.
Related Article: Preventing AI Hallucinations in Customer Service: What CX Leaders Must Know
Current Limitations and Long-Tail Use Cases
Can I tell you something else that surprised me? It seemed like the bot struggled with more common type of diagnoses. Do you think it's just like waiting to diagnose that rare case? So it skips over the fact that it could just be a stomachache?
We haven't applied it to your everyday sort of GP or primary care physician experience, where you have a skin rash, or you've got a pain with your knee, and so this does tend to be the longer tail of complex cases. But it goes without saying that there is less of that information in the training data. And we know that if there is more training data, the models do better. So the model is almost certainly going to do better in those primary care type environments than it's doing on the long tail.
AI Won’t Replace Empathy and Human Guidance
Your release around this research says doctor clinical roles are “much broader than simply making a diagnosis. They need to navigate ambiguity and build trust with patients and their families in a way that AI isn't set up to do.”
Can I just take the other side of this? If you're talking with a bot every day, you might trust it more than a doctor you see once a year, or even a new specialist. So AI poised to take on some of that work as well?
It's possible that it will do some of that work. Certainly, I hope one day that it will be good enough to do that kind of work. But nothing is going to replace the human to human connection that you get in the physical, real world at a moment of heightened anxiety and fear, when you're just facing one of the biggest challenges in your life and you have a massive diagnosis ahead of you, or when you just need everyday, regular treatment and care. So that's going to continue to be the role of a doctor, and hopefully they get to spend more time face to face with patients.
The Doctor’s Role Is Evolving, Not Disappearing
So doctors become something of auditors of the output of these AI bots in the future? They're becoming shepherds that are shepherding patients through their care journey?
There's still going to be a tremendous amount of judgment that is required by expert human doctors, both as part of the diagnosis, and then secondly, making the judgment about what works for the patient, and factoring in, and helping a patient decide what journey do I want to take, given that I now know I've got this diagnosis, what treatments do I want to take and when, and what are the trade-offs there. So that is going to require a tremendous amount of judgment, and so it's not just about the human-to-human connection and being on your feet. It's also thinking in a deeply empathetic way alongside a patient that's received a diagnosis to plan their treatment course.
Beyond Healthcare: Orchestrated AI for Any Field
What other professions could you see this being applied to this type of system?
The basic method of these orchestrators is that they tune different AIs to play very specific roles and then have the AIs negotiate with one another. That obviously applies to a lot of different environments, be it in business or even in government in the future. And so I think if this finding holds and applies to other domains, I think it will be be very, very promising, because it's also how we, collectively as a human species, work, right? We generally consult very widely when we make decisions, often, there's even consensus before actually coming to a final conclusion. So it has a lot of parallels to the human world.
Related Article: Digital Experience Gets Real: Personalization, AI and What Comes Next
What’s Next for MAI-DxO in Clinical Settings?
Lastly, this isn't being rolled out broadly in a hospital setting yet, so everyone who's panicking at this point can relax. But is that the ultimate goal? Is it an education tool, or does this actually become integrated in medical centers and hospitals in the coming years?
At the moment, this is just early research, and we're figuring out how best to deploy it, but I think the fact that we're able to get a 4x improvement on human performance across the board, on diagnosis with significantly reduced cost in super fast time — to me, that feels like steps towards a true medical super intelligence, and we would want to try and make that available as widely as possible, as quickly as possible, including for our 50 million daily health queries. And so that's going to be our ambition: Get it in front of consumers as fast as possible, in the safest way possible.
Learn how you can join our contributor community.