ChatGPT as a doctor replacement? Study shows sobering results
AI language models excel in medical exams – but when real people ask them for advice, collaboration fails.
(Image: Lalaka/Shutterstock.com)
Large language models like GPT-4o are now achieving near-perfect results in medical knowledge tests. They pass the US medical licensing exam, summarize patient records, and can classify symptoms. Health authorities worldwide are therefore examining whether AI chatbots could serve as the first point of contact for patients – a kind of "new gateway to the healthcare system," as stated in a strategy paper by the UK's NHS.
However, the study "Reliability of LLMs as medical assistants for the general public: a randomized preregistered study" by researchers from the University of Oxford significantly dampens these hopes. The work is published in the journal Nature Medicine, and a pre-print version is available on arXiv. The central finding: the clinical knowledge of the models cannot be transferred to interactions with real people.
1298 participants, ten medical scenarios
For the randomized, controlled study, the researchers recruited 1298 participants from Great Britain. Each subject was presented with one of ten everyday medical scenarios – such as sudden severe headaches, chest pain during pregnancy, or bloody diarrhea. The task: to assess what illness might be present and whether a doctor's visit, the emergency room, or even an ambulance was necessary.
The participants were randomly divided into four groups. Three groups had access to one AI model each, which was current at the start of the study – GPT-4o, Llama 3, or Command R+. The control group was allowed to use any aids, such as an internet search.
Videos by heise
AI excels alone – fails with humans
The results reveal a remarkable discrepancy. Without human involvement, even the language models, which are no longer current, identified at least one relevant illness in 94.9 percent of cases. When asked for the correct course of action – self-treatment, GP, emergency room, or ambulance – they were correct on average in 56.3 percent of cases.
However, as soon as real people queried the models, the values plummeted. Participants with AI support recognized relevant illnesses in only a maximum of 34.5 percent of cases – significantly worse than the control group with 47 percent. When choosing the correct course of action, all groups performed equally: around 43 percent accuracy, regardless of whether a chatbot assisted or not.
Double communication failure
The researchers analyzed the chat logs between users and AI models to understand the causes. They identified two central weaknesses: Firstly, participants often provided incomplete information to the models. Secondly, users did not correctly understand the AI's responses – even though the models named at least one correct diagnosis in 65 to 73 percent of cases, participants did not reliably adopt them.
Dr. Anne Reinhardt from LMU Munich sees a fundamental gap here: "Many people quickly trust AI answers to health questions because they are easily accessible. They also sound very convincing linguistically – even when the content is actually medically completely wrong."
Benchmarks are misleading
The researchers compared the performance of the models on the MedQA benchmark – a standard test with questions from medical exams – with the results of the user study. In 26 out of 30 cases, the models performed better on multiple-choice questions than in interactions with real people. Even benchmark values of over 80 percent sometimes corresponded to user results below 20 percent.
Prof. Ute Schmid from the University of Bamberg critically assesses the high performance of the models "alone": "I find the statement that the performance of the language models is significantly higher 'alone' than with users somewhat misleading. In this case, the queries were likely formulated by individuals with expertise and experience with LLMs."
What would a medical chatbot need to be able to do?
The experts agree that specialized medical chatbots would need to be designed differently from current all-purpose models. Prof. Kerstin Denecke from the Bern University of Applied Sciences outlines the requirements: "A medically specialized chatbot would need to provide evidence-based, up-to-date information. Furthermore, it would need to reliably recognize emergencies, consider individual risk factors, and transparently communicate its limitations. It should conduct a structured anamnesis to reliably triage. And it should not be tempted to make a diagnosis."
However, the hurdles for such use are considerable, according to Denecke: "Major hurdles are, on the one hand, regulation – depending on the function as a medical device or high-risk AI. On the other hand, there is liability, data protection, and technical integration into care processes."
Tests with real users are essential
The conclusion of the Oxford researchers is clear: Before AI systems are deployed in healthcare, they must be tested with real users – not just with exam questions or simulated conversations. Schmid advocates for a differentiated approach: "Quality-assured chatbots could, for example, be offered through statutory health insurance funds and recommended by general practitioners' offices as a first point of access. However, people should not be forced to use these services."
(mack)