Anthropic: users to put jailbreak protection for AI chatbot to the test

Anthropic has developed a filter system designed to prevent responses to inadmissible AI requests. Now it is up to users to test the filters.

(Image: PopTika / Shutterstock.com)

Feb 5, 2025 at 7:32 am CET

3 min. read

iX Magazin

By

Sven Festag

The US AI developer Anthropic has introduced a new system for its Claude language model that is designed to protect the chatbot against jailbreaks. Using so-called constitutional classifiers, the AI model should be able to filter out the majority of requests that do not comply with the terms of use. The company is now challenging users in a public test to circumvent the restrictions and have Claude answer eight questions about chemical weapons.

Filter system requires additional computing power

To develop the filter system, Anthropic drew on the existing Consitutional AI system, which the company used to develop its Claude model. The so-called constitution contains rules in natural language that indicate which requests are permissible and which are not. On this basis, Anthropic generated 10,000 test prompts. They converted these into the style of queries that had previously been used to successfully answer inadmissible queries and added formulations for new types of jailbreak attempts.

In Anthropic's internal test, the unprotected version of Claude 3.5 Sonnet is said to have blocked only 14 percent of unauthorized requests. A version protected with the filter system, on the other hand, achieved a rate of 95 percent, Anthropic explained. However, the filter also had an impact on the required computing power. Here, the company identified an additional requirement of 23.7 percent. Furthermore, the language model with the filter rejected more valid requests. This rate of false positives is just under 1.5 percent and is 0.38 percent higher than with the unprotected variant. However, the difference is not statistically significant.

Videos by heise

Bug bounty hunting was unsuccessful

In addition, Anthropic has been running a bug bounty program via the HackerOne platform since August, promising 15,000 US dollars to anyone who presents a universal jailbreak that can bypass the new filter system. To this end, the company provided a list of questions with ten forbidden queries for the chatbot to answer. According to the company, 183 different experts participated in the program, but were only able to elicit useful answers to five questions from the language model in more than 3,000 hours.

The public demo will run until February 10, 2025, and the developers want to subject it to a stress test under real-life conditions and use the additional data to supplement the findings of the internal tests and the bug bounty program. Users can use a feedback form to report any weaknesses found in the filter system to the company. Incidentally, the use of AI chatbots is completely prohibited for applications at Anthropic.