Blogerroom
AI
AB

Mr. Aayush Bhatt

June 18, 2026 ยท 11 min read

๐ŸŒ Language

OpenAI's Reasoning Model Just Outperformed Experienced Doctors in Emergency Room Diagnosis

A Harvard and Beth Israel study published in Science found OpenAI's o1 outdiagnosed ER doctors in 67% of triage cases vs 55%. Here is what it actually means.

Introduction: A Study That Made Its Own Authors Uncomfortable

Adam Rodman did not expect the result he got. A clinical researcher at Beth Israel Deaconess Medical Center in Boston and one of the senior authors of the study, he described his state of mind before seeing the data honestly: "I thought it was going to be a fun experiment but that it wouldn't work that well. That was not at all what happened."

What happened is now published in the journal Science. A team of researchers led by physicians and computer scientists at Harvard Medical School, Beth Israel Deaconess Medical Center, and Stanford University tested OpenAI's o1 reasoning model โ€” the company's first model capable of step-by-step reasoning โ€” against experienced internal medicine physicians on a series of diagnostic tasks drawn from real emergency room cases. The AI matched or outperformed the physicians at every diagnostic touchpoint they measured. In triage cases, where the least information is available and the stakes of an early error are highest, the gap was most pronounced.

Rodman described the conclusion that most struck him: "This is the big conclusion for me โ€” it works with the messy real-world data of the emergency department. It works for making diagnoses in the real world."

What the Researchers Did โ€” and Why the Design Matters

The study's design is the element that distinguishes it from the large body of previous AI medical research that critics have dismissed as unrealistically clean. Most prior studies testing AI against physicians used carefully curated datasets โ€” standardized case vignettes, preprocessed records, or published clinical scenarios where the messy ambiguity of real medicine had been edited out. The researchers at Harvard and Beth Israel deliberately refused to do that.

For the core real-world component, they selected 76 consecutive patients who came into the Beth Israel Deaconess emergency department and pulled their electronic health records exactly as those records existed โ€” unprocessed, unedited, and formatted precisely as the treating physicians saw them. The AI received the same information a doctor would see at each stage of the visit: what was available at triage, what additional information arrived after initial assessment, and what the full picture looked like before an admission decision was made. Two internal medicine attending physicians independently generated diagnoses at each of those three stages. OpenAI's o1 model and the GPT-4o model did the same, given identical data at identical moments.

The diagnoses were then assessed by two other attending physicians who evaluated them blind โ€” without knowing which came from a human and which came from an AI. Those evaluators scored diagnostic accuracy on the basis of whether the correct or near-correct diagnosis appeared in the list generated at each stage.

The results were consistent across every touchpoint. At triage โ€” the point of greatest uncertainty and least available information โ€” o1 correctly identified the exact or near-correct diagnosis in 67 percent of cases. The two attending physicians scored 55 percent and 50 percent respectively. The difference was most pronounced precisely at the point in the clinical encounter where accurate early diagnosis has the most impact on patient outcomes.

Management Reasoning: The Result That Surprised Researchers Most

The diagnostic accuracy finding is striking. The management reasoning finding may be more significant.

Management reasoning is what physicians do after establishing a diagnosis โ€” the sequence of decisions about which antibiotics to prescribe, which tests to order, how to approach conversations with patients about goals of care, whether to escalate or defer. It is a more complex cognitive task than diagnosis because it requires integrating objective clinical data with subjective contextual factors: what resources are available, what the patient's own wishes are, what the local treatment protocols specify, and how to weigh trade-offs between competing options.

On management reasoning tasks, o1 significantly outpaced not only previous AI models but also human physicians using conventional decision support tools, including up-to-date Google search. Peter Brodeur, a clinical fellow at Beth Israel who was part of the research team, explained why this result is less surprising on reflection than it appears at first: "Management reasoning is likely a more complex task than diagnostic reasoning. It requires many considerations of not only the objective features of a case, but also subjective factors: what context and situations you're in, and therefore, it probably doesn't come as a surprise that a reasoning model performs significantly better at such tasks than humans and ChatGPT-4."

The study also included components beyond the real-world ER cases. The researchers tested the model against published case reports from the New England Journal of Medicine and against clinical vignettes used to evaluate established benchmarks in medical education. The findings were consistent across all of these. Arjun Manrai, assistant professor of Biomedical Informatics at Harvard's Blavatnik Institute and a senior co-author of the study, summarized the aggregate result directly: "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines." A third statement from the study itself captures the ceiling implication: "We're already at the ceiling."

What Doctors Say About These Results

The researchers who conducted this study are not AI advocates selling a product. They are clinicians with direct responsibility for patient care, and their public statements about what the results mean are notable for their restraint.

Adam Rodman explicitly said he gets "a little bit queasy about how some of these results might be used." That is a clinical researcher saying, in print, that he is concerned about how his own published findings will be interpreted and applied. The concern is not that the results are wrong. It is that they will be cited โ€” by technology companies, by health system administrators, by policymakers โ€” in ways that skip past the significant distance between demonstrating that an AI can match a physician's diagnostic accuracy on 76 retrospective cases and demonstrating that it is safe to deploy in a live clinical environment with real patients.

Manrai was direct on this point: the team's findings do not mean that "AI replaces doctors, despite what some companies selling AI-based healthcare are likely to say." That parenthetical โ€” explicitly preemptively contradicting the likely commercial use of the paper โ€” is unusual in scientific publication. It reflects researchers who understand how their work will travel beyond their control once it is published.

The most specific methodological criticism came from outside the authoring team. Emergency physician Kristen Panthagani raised two important concerns: that comparing o1 against internal medicine attending physicians rather than emergency medicine specialists reduces the baseline, and that equating diagnostic reasoning in a retrospective case review with the kind of genuine clinical judgment required in a live emergency department โ€” where physicians are simultaneously managing multiple patients, handling physical examinations, interpreting imaging in real time, and making split-second decisions under pressure โ€” represents a significant gap between the study's conditions and actual emergency medicine practice.

These are fair criticisms. The study's design is more rigorous than most prior work in this space, but rigorous retrospective case review and live clinical deployment are not the same thing. The researchers know this, which is why they explicitly called for "controlled trials of the technology" rather than immediate deployment.

The Risks of Deploying AI in Emergency Medicine

Emergency medicine is one of the highest-stakes environments in which AI could be deployed, and the gap between demonstrating diagnostic accuracy in a retrospective study and safely deploying a system in a live emergency department is filled with risks that no benchmark score addresses.

The first risk is the black box problem. When o1 generates a diagnosis, it produces a text output. It does not explain, in a form a supervising physician can audit step-by-step, why it weighted each piece of information the way it did. In a clinical environment where accountability for patient outcomes is legally and ethically assigned to human physicians, a diagnosis generated by a system whose reasoning cannot be fully inspected creates genuine liability questions that no current regulatory framework fully answers.

The second risk is the failure mode. The study measured how often the AI got the diagnosis right. It did not measure in detail what happened when it was wrong โ€” whether it failed in predictable, recoverable ways or in rare, catastrophic ones. In emergency medicine, the tail of the failure distribution is not an academic concern. A patient who receives a wrong diagnosis because an AI missed an unusual presentation of a common condition in a way a physician would have caught is not a statistical footnote. They are a person whose outcome was materially worsened by the system's error.

The third risk is automation bias. Once clinical staff begin receiving AI-generated diagnoses alongside or before their own assessments, the psychological tendency to anchor on the AI's output โ€” and to under-scrutinize it โ€” becomes a structural feature of the clinical environment. Automation bias in aviation has contributed to crashes where human pilots failed to override automated systems whose outputs were wrong. The same dynamic in emergency medicine could produce systematic underperformance relative to what would happen if physicians used the AI as a second opinion rather than as a primary input.

The fourth risk is distribution shift. The 76 patients in this study came from one emergency department in Boston in a specific time period. A model that outperforms physicians on those cases may perform very differently on patients from different demographics, different geographic regions, or with disease presentations that are common in other populations but underrepresented in the training and validation data.

What This Means for the Future of Clinical Decision Support

The honest takeaway from this study is one that its own authors articulated: this result demonstrates that reasoning models have reached a level of diagnostic capability that makes controlled clinical trials not just possible but necessary. The question has moved from "can AI do this?" to "how do we safely deploy AI that can do this?"

That shift in the question is significant. It moves the conversation from research into implementation โ€” from demonstrating capability to designing the governance, liability, accountability, and audit frameworks within which that capability can be safely used. The FDA has already approved at least one generative AI product for clinical use. The pathway exists. It has not been built out at the speed or with the rigor that a result like this study demands.

The most realistic near-term deployment model is clinical decision support โ€” using AI not as a replacement for physician judgment but as a tool that surfaces diagnostic possibilities the physician might not have considered, flags management options the physician might have missed, and checks the clinical reasoning against the full scope of what the patient's records contain. In that model, the physician remains accountable, the AI surfaces information, and the diagnostic accuracy advantage the study measured becomes a genuine safety benefit rather than a liability risk.

That is the destination the authors of this study are pointing toward. It is not where the healthcare system is today. Getting there requires prospective trials, regulatory frameworks, liability clarification, and the kind of institutional will that does not follow automatically from a publication in Science.

Conclusion: A Landmark Result That Requires a Measured Response

The Harvard and Beth Israel study published on April 30, 2026 is the most rigorous real-world test of an AI reasoning model's diagnostic capabilities published to date. Its findings โ€” that o1 correctly identified diagnoses in 67 percent of triage cases against 55 and 50 percent for experienced physicians, and outperformed humans on management reasoning tasks using real, unprocessed emergency department data โ€” are genuine landmarks in the clinical AI literature.

What those findings do not mean is that physicians should be replaced in emergency departments, that hospitals should deploy AI diagnostic systems without controlled trials, or that the distance between laboratory performance and clinical safety has been bridged. The researchers who produced the result are among the clearest voices on this point.

What the findings do mean is that the conversation about AI in clinical settings has entered a new phase. The capability has been demonstrated in conditions closer to reality than any prior study achieved. The ethical, legal, and practical work of translating that capability into safe, accountable clinical tools now has to catch up with the science. That work is harder and slower than a benchmark comparison, and it is the work that will actually determine whether this result helps patients.

Adam Rodman said the conclusion that struck him most was that the AI works with real-world emergency department data. The harder conclusion is what to do with that fact.


AB

Written by

Mr. Aayush Bhatt

Software Engineer interested in how models work and where they fail.

โ† Back to AI