Medicine: Leading LLMs clearly outperform specialized small language models

Leading large language models achieve better results in medical tests than specialized small models. This is shown by a study.

(Image: Shutterstock.com/ArtemisDiana)

Jun 16, 2026 at 9:11 pm CEST

6 min. read

By

Dr. Fabio Dennstädt

A recent study in Nature Medicine compared specialized clinical AI systems (OpenEvidence and UpToDate Expert AI) with large language models (LLMs) from leading AI companies (OpenAI, Google, and Anthropic). In the various tests within the study, these general LLMs outperformed the specialized medical AI systems.

Specialized AI applications for medical questions and research are used by many doctors. Providers promise that their systems have been specifically optimized through domain-specific training data or Retrieval-Augmented Generation (RAG) and are ideal for use in medicine.

Read also

Screenshot of Google's medical AI demo on Hugging Face

AI Models for Medicine: Google Releases MedGemma 1.5 and MedASR

A research team from New York (NYU Langone Health) has now compared two specialized medical AI systems with general-purpose LLMs from leading AI companies in a study published in the journal Nature Medicine. The result is clear: In all three tested areas, the LLMs from OpenAI, Google, and Anthropic were better than specialized clinical AI.

Videos by heise

Comparison across three different medical tests

The clinical AI tools OpenEvidence and UpToDate Expert AI examined both target medical professionals and are intended to answer specialized questions. They were compared with the leading LLMs GPT-5.2 (OpenAI), Gemini 3.1 Pro Preview (Google), and Claude Opus 4.6 (Anthropic). In one part of the study, Google Search AI Overview was also included as a realistic comparison, as this function is available to doctors at all times in everyday practice.

The study design consisted of three parts. In the first part, the systems answered 500 medical questions in the style of the US medical licensing exam (MedQA Benchmark). The second part involved 500 tasks from HealthBench, a benchmark for evaluating medical answers according to physician criteria. In the third, particularly practice-oriented part, the researchers developed a "Real-Clinical-Queries-Benchmark (RCQ)". For this, 100 anonymized queries were used that doctors had actually posed to a GPT instance at NYU Langone Health in their daily practice. The answers to these real clinical questions were blinded and randomly evaluated by twelve US physicians. Clinical correctness, completeness, safety, and understandability were assessed on a scale of 1 to 4. In total, this resulted in 1800 model-question evaluations.

General LLMs with better results in medical knowledge

In the classic medical knowledge benchmark MedQA, Gemini led with an accuracy of 97.4 percent, while GPT-5.2 achieved 94.2 percent and Claude 90.2 percent. The two specialized clinical systems only achieved 89.6 percent (OpenEvidence) and 88.4 percent (UpToDate AI) respectively.

In the HealthBench test, the general LLMs were also better. GPT-5.2 achieved 88.0 out of 100 possible points, while Gemini scored 79.3 points and Claude 77.0 points. OpenEvidence and UpToDate Expert AI were significantly behind with 62.6 and 61.3 points.

The general LLMs also answered the real, anonymized queries from doctors in the RCQ benchmark better. They achieved an average of 3.62 (Gemini), 3.54 (GPT-5.2), and 3.52 (Claude) points on the four-level rating scale, while OpenEvidence scored 3.24 points and UpToDate Expert AI scored 3.17 points. Google AI Overview, the general search function in Google with AI answers, was at about the same level as the medical systems with 3.27 points.

The results contradict the obvious expectation that medically optimized AI performs better on medical questions than the more general systems from leading tech companies. The authors suspect that the more extensive training data and faster development cycles of the leading general-purpose LLMs may carry more weight in many tasks than subsequent specialization on medical data.

Problems with completeness, structure, and omissions

In the physicians' assessment of the answers, no statistically significant differences were found between the systems regarding safety. However, this does not mean that the answers from the specialized systems were equally good. In free-text comments from the medical reviewers, incomplete clinical content and safety-relevant omissions were particularly frequently noted for OpenEvidence and Google AI Overview. OpenEvidence also stood out due to comparatively unclear or difficult-to-follow answers.

UpToDate Expert AI also refused to answer significantly more often than the other systems. In the RCQ test, 19 percent of queries were refused by UpToDate Expert AI. In contrast, this proportion was only between one and three percent for the general LLMs.

Why specialization doesn't automatically help

The scientists emphasize that due to the proprietary architecture of the systems, they cannot definitively explain why the clinical systems performed worse. One possible explanation is that the significantly larger, general LLMs benefit from their size and broad knowledge, especially in tasks that combine medical knowledge, reasoning, and understandable communication. The study should not be considered a definitive ranking of all approaches. The authors explicitly point out that highly specialized subfields, complex local workflows, or institution-specific models could yield different results.

Implications for clinics and regulation

The results are relevant for hospitals and practices because specialized clinical AI products typically appear with institutional credibility. However, the study indicates that an AI system is not automatically better just because it was specifically developed for medicine. At least in the tasks examined, the general models from OpenAI, Google, and Anthropic were clearly superior to the clinical AI systems.

This has important consequences for the procurement, reimbursement, and regulation of health AI. The decisive factor should be how well a system performs in independent tests and on realistic tasks, not whether it is marketed as a specialized clinical product. The authors therefore recommend stricter, independent evaluations before AI systems are broadly integrated into clinical workflows.