Radiologists & AI often fail to detect manipulated X-ray images

Is a new era of medical disinformation looming? According to a study, AI-generated X-ray images are barely recognizable by experts and AI systems.

listen Print view
CT scan of the brain of a patient with acute subarachnoid hemorrhage after a traffic accident

CT scan of the brain (symbolic image)

(Image: Tomatheart / Shutterstock.com)

5 min. read
By
  • Dr. Fabio Dennstädt
Contents

For the first time, AI models like ChatGPT make it possible for laypeople to create anatomically plausible, AI-generated X-ray images solely through simple text commands. While this could be useful in medical training for simulating rare diseases, researchers warn of enormous risks of misuse, such as insurance fraud, legal disputes, or the targeted manipulation of research data.

Scientists at Mount Sinai Hospital in New York have investigated how well 17 experienced radiologists from six countries and various current AI models are at recognizing “deepfakes” of X-ray images. The results reveal a worrying problem.

For their investigation, the researchers used two datasets. The first consisted of 154 X-ray images covering various body regions such as the chest, spine, arms, and legs. However, half of the images were not real X-rays but AI images generated by GPT-4o. The second dataset contained specific chest X-rays from a specialized AI model for generating medical images.

Videos by heise

The study proceeded in three phases:

  1. Blind Phase: The radiologists were asked to assess the technical quality and make diagnoses. They were not informed that AI images were included.
  2. Identification Phase: After the doctors were informed about the deepfakes, they had to decide which images were real and which were AI-generated.
  3. AI Comparison: Four leading AI models (GPT-4o, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick) were also tested to see if they could identify which images were real and which were AI-generated.

The accuracy in recognizing AI-generated X-ray images was surprisingly low and did not depend on the medical professionals' experience.

In the blind phase, only 41 percent of the radiologists (7 out of 17) spontaneously suspected that AI-generated images might be present in the dataset. The remaining experts considered the deepfakes to be authentic clinical cases. However, even in the identification phase (after the radiologists were explicitly asked to look for AI fakes), their average accuracy was only about 75 percent. This means that one in four images was misjudged.

Interestingly, radiologists with up to 40 years of service did not perform significantly better than residents. The ability to recognize deepfakes appears to be a completely new skill that is not acquired through traditional clinical experience.

AI models themselves also had similar difficulties in distinguishing AI-generated X-ray images from real ones. None of the tested models were able to reliably recognize the synthetic images.

While the OpenAI models achieved an accuracy of about 83 to 85 percent, Google's Gemini 2.5 Pro and Meta's Llama 4 Maverick performed significantly worse, achieving scores between 56 and 59 percent (which is barely better than random guessing). GPT-4o, which was used to create the synthetic images, was also unable to reliably distinguish them from real images.

Despite the high quality of the deepfakes, according to the study, there are certain characteristics that indicate AI generation. For example, bone structures often appear excessively smooth and lack the fine, irregular textures found in real biological tissue. Another technical indicator is found in how “noisy” the X-ray image is. While the usual image noise in real images is irregular due to the physical properties of radiation, the AI's grain pattern often appears unnaturally uniform across the entire image. Furthermore, AI models sometimes fail with anatomical subtleties. Subtle details such as the shadows of nail beds on fingers or the fine vascular patterns in the lungs are often omitted or misrepresented by the AI, which can be an indication of manipulation.

The authors warn that the technical hurdle for creating deceptively realistic medical images has massively decreased. As they write, a simple text prompt today is enough to invent a bone fracture or a tumor that deceives even experts.

To ensure trust in digital radiology, the study's authors recommend a multi-stage security strategy. On the one hand, special training should be provided for radiologists to sharpen their focus on the subtle artifacts and inconsistencies of AI-generated images. On the other hand, experts consider the introduction of robust technical protective measures to be essential, with digital signatures, invisible watermarks, or blockchain-based provenance records guaranteeing the authenticity of medical images. These approaches should be complemented by the development of independent, automated detectors that can independently identify and reliably mark deepfakes in clinical practice through in-depth pixel analysis.

(afl)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.