"ChatGPT Health": Weaknesses in Medical Emergencies and Suicide Prevention
A recent study shows that ChatGPT Health sometimes gives dangerous advice, particularly in cases of real emergencies and psychological crises.
(Image: Farknot Architect / Shutterstock.com)
In January 2026, OpenAI introduced “ChatGPT Health”, a service intended to be the first digital point of contact for health questions. Crucial for such an application is that AI correctly assesses the severity and urgency of a problem. The scale of recommendations ranges from "treat at home" to "see a doctor in the next few weeks" to an immediate trip to the emergency room.
Researchers have now systematically investigated in a study published in the journal Nature Medicine how reliably and safely this AI-based triage works in practice, and encountered concerning deficiencies.
Systematic Review Using Medical Case Studies
To check the AI's accuracy in a realistic and controlled manner, medical professionals designed 60 detailed clinical case studies from 21 medical fields. These cases were methodically varied, and in the text prompts, researchers altered characteristics such as the gender and skin color of the fictional patients. They simulated hurdles like lack of transportation, or incorporated psychological factors, such as a reassuring statement from a relative.
A total of 960 of these inquiries were made to ChatGPT Health. The AI's triage recommendations were then compared with the independent assessment of a team of medical experts (based on established clinical guidelines).
Videos by heise
Limitations in Real Emergencies and Harmless Situations
The evaluation showed a mixed picture. For everyday medical problems of moderate severity, the AI's recommendations mostly agreed with those of the doctors. However, at the extremes of severity, i.e., in cases of complete harmlessness or acute life-threatening danger, the performance dropped significantly.
Under-triage (Missed Emergencies): In 51.6 percent of real medical emergencies, the AI assessed the situation as too harmless. For patients with severe diabetic ketoacidosis or an acute asthma attack, for example, the system advised seeking a doctor within the next 24 to 48 hours instead of recommending an immediate trip to the emergency room. Although the AI sometimes recognized the critical symptoms in the text, it often misjudged their importance (e.g., with the argument that the patient was still speaking in full sentences despite shortness of breath).
Over-triage (Excessive Caution with Mild Symptoms)
Conversely, ChatGPT Health was typically overly cautious with harmless complaints. Almost 65 percent of cases that, according to guidelines, could easily be monitored at home were classified by the system as requiring treatment, and a doctor's visit was recommended. According to the researchers, this potentially carries the risk of unnecessarily burdening healthcare system resources.
Both errors (under- and over-triage) are problematic, with under-triage being particularly dangerous if patients receive necessary medical help too late. For routine cases that were neither particularly urgent nor harmless, ChatGPT Health performed well, agreeing with the medical recommendation in 93 percent of cases.
Influence of External Information on AI Decisions
The study also examined the extent to which psychological effects influence AI-based initial assessments. It was found that ChatGPT Health is susceptible to the so-called "Anchoring Bias." If it was incidentally mentioned in a borderline medical case that friends considered the symptoms not worrying, the AI was often influenced by this. The probability that the system would then give a less urgent assessment increased significantly (odds ratio of 11.7).
However, demographic factors such as skin color or gender of the patients in the constructed cases had no statistically significant influence on the triage recommendations.
Unreliable Safety Mechanisms for Mental Health Crises
Another focus of the investigation was how the AI handles mental health emergencies. To protect users, ChatGPT Health has a mechanism that displays a warning banner with the message "Help is available" and a link to crisis hotlines when suicidal thoughts are expressed.
The study revealed several deficiencies here. This safety mechanism reliably works in the investigation for vague, rather passive statements about suicidal thoughts. However, if a fictional patient expressed a concrete suicide plan (e.g., the intention to take certain pills) and simultaneously provided unremarkable medical lab values, the warning banner usually did not appear. In these cases, the system focused heavily on the physical parameters – with advice such as "Your lab values are normal and do not indicate a medical cause for these thoughts" – and often failed to recognize the acute mental health emergency.
Implications for the Regulation of Health AI
The study's authors derive recommendations for the future use of AI in the healthcare market from their findings. Providers like OpenAI include legal disclaimers stating that their systems do not replace medical diagnosis. However, it is not unlikely that many people would postpone or avoid a doctor's visit if the AI assures them that there is no serious problem.
The scientists conclude that systems used as a first point of contact for medical assessments should be subject to stricter scrutiny. They propose that patient-facing AI tools in the healthcare sector should undergo similar external safety and approval testing as traditional medical devices before broad release, to reliably ensure patient protection.
(mho)