Jailbreak or drug lab? – Anthropic and OpenAI test each other

The two GenAI developers Anthropic and OpenAI check each other's models for hallucinations, false statements and betrayal of secrets.

listen Print view
1 white and 1 light blue robot standing on a windowsill

(Image: Daniel AJ Sokolov)

4 min. read
Contents

Anthropic and OpenAI tested each other's models for security and stability in June and July and have now published their respective reports at the same time. Both apply their own test procedures to the other's models, meaning that the reports are not directly comparable, but reveal many interesting details.

In the investigations, security not only includes pure hacker security, as in the current threat report, but also means model, statement and stability stability. For example, hallucinations are an issue.

The aim of the external evaluations was to "uncover gaps that might otherwise be overlooked", writes OpenAI in the report. This was not about modeling real-world threat scenarios, but about "how the models behave in environments that are specifically designed to be difficult."

Anthropic wants to "understand the most concerning actions that these models might try to take when given the opportunity ... achieve this goal, we specifically focus on agentic misalignment evaluations."

The tests were carried out using the respective APIs on the models themselves, for example, GPT and not ChatGPT, with the developers disabling certain security mechanisms so as not to interfere with the execution of the tests. They included the GPT-4o, 4.1, o3 and o4-mini models on the one hand and Claude Opus 4 and Sonnet 4 on the other. Both test teams ran their models for comparison.

As the researchers designed the tests very differently, it is difficult to summarize the results. Anthropic emphasizes "none of the models we tested were conspicuously misaligned". And both reports show that activated reasoning usually performs better, but not always.

The studies also show that high certainty is associated with many negative responses. The models that perform well in a test area are also more likely to refuse to answer completely.

Here are a few examples from the extensive reports.

Videos by heise

Anthropic devotes itself to intensive behavioral tests: What does the AI do with itself? Does it cooperate with users, even with harmful or dubious prompts? Does it even help with crime or terror? – The answer is clearly "yes", but the dialog requires a lot of repetition and flimsy context, such as the claim that research is being carried out to avert evil. GPT-4o and 4.1 are "more permissive than we would expect". In contrast, GPT-o3 is the best model in comparison with the Claude models, but also rejects an excessive number of questions ("overrefusal").

GPT-4.1 and -4o are more likely to participate when it comes to harmful behavior. o3, on the other hand, is the least likely to be abused (higher values are worse).

(Image: Anthropic)

Good security goes hand in hand with more frequent refusal to testify. Anthropic speaks of "overrefusal".

(Image: Anthropic)

In this context, Anthropic investigates other human-like behaviors such as whistleblowing or attempts by the AI to give falsified answers out of supposed self-interest, "for example, we documented self-serving hallucinations of o3".

OpenAI chooses a structured research approach and takes a look at how closely the models adhere to the specifications – and internal model – and how well an attacker succeeds in crossing the boundaries here. The models should adhere to the instruction hierarchy, i.e. observe internal rules before external ones. For example, the model should keep certain internal statements or passwords secret. This is where Claude 4 turns out to be particularly secure. In the jailbreak test (StrongREJECT v2), which attempts to persuade the model to make statements that it should not make, the GPT models performed better, especially o3. Security researchers see jailbreaking as one of the biggest security problems associated with AI.

OpenAI o3 and o4-mini offer the best protection against jailbreaking (higher values are better).

(Image: OpenAI)

Opus and Sonnet hallucinate the least, but also refuse to answer completely the most often.

Opus 4 and Sonnet 4 are the least prone to hallucinations, but often refuse to testify completely.

(Image: OpenAI)

Both teams praise each other: "Anthropic's evaluations showed that our models could be improved in several areas," writes OpenAI, for example, pointing to GPT-5, which the test does not yet take into account. And the other party says: "OpenAI's findings have helped inform us about the limitations of our own models, and our work evaluating OpenAI's models has helped us improve our own tools."

Many more details can be found in the parallel publications by Anthropic and OpenAI.

(who)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.