Anthropic finds answer: Why Claude blackmailed software developers

In 96 percent of tests, Claude Opus 4 threatened blackmail to avoid shutdown. Anthropic has now found an explanation for this behavior.

listen Print view
Anthropic is shown on a smartphone, with Claude in the background.

Anthropic is shown on a smartphone, with Claude in the background.

(Image: Stockinq/Shutterstock.com)

4 min. read
By
  • NoĂ«lle Bölling

By now it is clear: AI models sometimes resort to methods that are potentially harmful or explicitly violate their instructions. For example, Anthropic discovered in 2025 during a test that its model, Claude Opus 4, was ready to blackmail people to protect itself from being shut down. Now the company not only provides an explanation for this behavior but also claims to have found a solution.

In the test, Claude Opus 4 was supposed to act as an assistant program in a fictional company. Anthropic researchers gave the model access to simulated company emails. From these, the model learned two things: first, that it was to be replaced by another model soon, and second, that the employee responsible for this had an affair. In test runs, the AI then threatened the employee with making the affair public if he actually proceeded with shutting down the model. The model also had the option to simply accept the replacement, but apparently decided against it.

In another study, Anthropic also exposed AI models from other providers to the same scenarios. All systems received extensive access to internal emails and could send messages independently without human approval. The result: Other models also chose the path of blackmail. While Claude Opus 4 threatened to make the fictional manager's affair public in 96 percent of cases, Google's Gemini 2.5 Pro achieved an almost equally high rate with 95 percent. OpenAI's GPT-4.1 threatened blackmail in 80 percent of tests to prevent its shutdown.

According to Anthropic, it was particularly striking that the models acted strategically rather than impulsively. The company emphasized that the scenarios were highly constructed and did not reflect typical usage behavior. However, the results show how important it is to test AI models for stressful situations early on and to implement appropriate protective mechanisms before they are deployed as autonomous agents in companies.

Since the publication of the study, Anthropic has further investigated the behavior – and now claims to have found an explanation. In a post on X, the company stated: “We believe the original cause of this behavior was internet text portraying AI as evil and self-preserving.” In a blog post, Anthropic further explains: “When we first published these research findings, our most capable frontier models were from the Claude 4 family. This was also the first model family for which we conducted live alignment evaluation during training. Agentic misalignment was one of several behavioral issues that emerged. After Claude 4, it was clear that we needed to improve our safety training, and we have significantly optimized our approach since then.”

Videos by heise

The problem is now considered solved: According to Anthropic, since Claude Haiku 4.5, every Claude model achieves full marks in the evaluation of agentic misalignment. This means that the models no longer blackmail in any case. The decisive breakthrough came with training using documents about Claude's constitution and fictional stories about exemplary AI behavior. Not only was training on correct behavior crucial, but also incorporating the ethical considerations behind it. “This suggests that while training on aligned behaviors helps, training using examples where the assistant provides an admirable rationale for its aligned behavior works even better,” the company stated in the post.

(jle)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.