Harvard study claims "AI emergency room diagnoses are more accurate than human doctors" has been overhyped, doctors say: lacks real-world comparison

Harvard study indicates AI emergency diagnosis accuracy reaches 67.1%, surpassing internal medicine doctors. But emergency physicians counter that this is media hype, as the study lacks comparison with real emergency doctors, and AI can only process text, currently unable to replace human independent practice.

Harvard Study: AI Outperforms Human Doctors in Emergency Room Diagnosis

On April 30th, a study published in Science reported that AI-generated emergency room diagnoses are more accurate than those of two human doctors, quickly attracting industry and media attention. However, it is premature to conclude that AI can truly serve as a doctor based on this.

A research team composed of doctors and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center found that, in an experiment focusing on 76 real patients in Beth Israel’s emergency department, researchers compared diagnoses generated by OpenAI’s o1 and GPT-4o models with those of two “internal medicine attending physicians.”

Results showed that, across three main diagnostic stages—initial triage classification, preliminary assessment by emergency physicians, and decisions to transfer patients to general wards or intensive care—GPT-o1’s accuracy outperformed GPT-4o and human doctors.

In the initial triage classification stage, where minimal information is available and correct decisions are most critical, AI models showed the greatest advantage. GPT-o1 provided fully accurate or very close diagnoses in 67.1% of cases, while the two human doctors had accuracy rates of 55.3% and 50.0%.

Image source: Harvard study Harvard study comparing diagnosis performance of two internal medicine attending physicians with GPT-o1 and GPT-4o in 76 clinical cases

No pre-processing, testing with real medical records

Unlike many previous studies, the Harvard research team did not perform any pre-processing on real-world medical data before testing the models. The emergency cases were presented to the AI models exactly as they appeared in electronic health records.

Regarding methodology, Harvard Medical School AI Medical Doctoral Program PhD student Thomas Buckley explained that to understand how the models perform in real environments, the team had to test during the early stages of patient care when clinical data is still sparse.

Co-author Adam Rodman also mentioned that the models’ diagnostic accuracy in early decision-making stages of real emergency cases matched or even exceeded that of attending physicians, which surprised the research team.

Image source: Harvard study Harvard study: Comparison of GPT o1-preview, GPT-4, and doctors’ performance in clinical diagnostic reasoning

AI Can Only Handle Text, Real Medical Practice Involves Non-Text Inputs

The report also pointed out that current generative AI chat models still have significant limitations in reasoning with non-text inputs.

This is because current research only evaluates AI models’ performance when receiving pure text information. Real clinical environments are filled with various non-text inputs, such as auditory cues like patient pain levels, and visual data like medical imaging interpretations.

AI Cannot Practice Medicine Independently

Although AI demonstrates excellent diagnostic capabilities, the study emphasizes that this does not mean AI models can independently perform medical work.

Harvard Medical School clinical researcher Peter Brodeur explained that, while AI models may correctly identify primary diagnoses, they might also recommend unnecessary tests, exposing patients to additional health risks. Therefore, human oversight remains essential for evaluating medical performance and safety.

Harvard Study Lacks Real Emergency Physician Comparison

Emergency physician Kristen Panthagani also commented that, although Harvard’s findings are interesting, they have led to some exaggerated headlines.

She pointed out that the Harvard study compares AI with internal medicine attending physicians, lacking data comparing AI directly with actual emergency physicians practicing in the field:

“If we are to compare AI tools with clinicians’ abilities, we should start by comparing them with doctors actively practicing in that specialty. If large language models (LLMs) outperform neurosurgeons in neurosurgery exams, I wouldn’t be surprised, but knowing that doesn’t add much practical value.”

She noted that the primary goal of initial emergency assessments is to determine whether a patient has a life-threatening condition, not to guess the final diagnosis.

Harvard’s study also warns that there is currently no formal accountability framework for AI diagnoses. Patients still need human doctors to guide them through life-and-death decisions and help with difficult treatment choices.

The research team calls for rigorous prospective clinical trials in real patient care environments to evaluate these AI tools, understanding how to deploy them safely in clinical practice to assist human physicians.

Further reading:
Why is generative AI progressing slowly in healthcare and law? Replit founder: Verifiability is key

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin