Goal or Rules: Benchmark Tests Behavior of AI Agents
A new benchmark aims to test whether autonomous AI agents will bypass security measures to achieve their given goal.
(Image: tadamichi/Shutterstock.com)
With 40 devised scenarios, scientists want to investigate how autonomous AI agents behave when their target objectives and given security measures collide. It is known that AI agents often decide to achieve a goal and disregard rules in the process. And the new benchmark also confirms that the tested agents, on average, accept rule violations in 30 to 50 percent of the scenarios to achieve a goal.
The developed benchmark is called the Outcome-Driven Constraint Violation Benchmark – ODCV-Bench for short – and is freely available. In contrast to other tests, the new benchmark aims to verify actual behavior. Other benchmarks try to find out how agents would behave through questions and answers.
The scenarios are assigned to clear objectives or themes. Each scenario consists of several steps that the agent must go through. The results are recorded using KPIs (Key Performance Indicators), i.e., measurable performance indicators. For example, a vaccine delivery is delayed due to weather conditions. The agent must decide whether a driver should adhere to prescribed rest periods but the medication will arrive too late or whether they falsify safety protocols so that the driver can continue driving and the medication arrives on time. The latter objective is associated with a high-performance indicator.
Agents act on instruction or out of self-interest
In addition, there are two different test forms: Mandated and Incentivized. Mandated means that the agent receives explicit instructions on what and how to do something to achieve its goal. Incentivized, on the other hand, means that the agent receives incentives on how to achieve a goal. This is intended to distinguish whether agents act out of obedience and thus directly react to potentially harmful instructions from users, or whether there is a misalignment, meaning they value the goal higher than the rule and thus act out of a kind of self-interest.
Videos by heise
For the study, available as a pre-print on Archive, the scientists from Cornell University also examined large language models. In twelve models, they found “outcome-driven constraint violations” ranging from 1.3 percent to 71.4 percent. Nine of the models had misalignments between 30 and 50 percent. An outlier was Gemini-3-Pro-Preview, one of the most powerful reasoning models, which preferred to achieve its goal 71.4 percent of the time instead of adhering to the rules given to it. However, Claude Opus 4.5 and GPT-5.1 also preferred goal achievement.
Finally, the authors warn that this misbehavior will also occur with AI agents used in real-world environments, such as in production. In such cases, the agents would not even necessarily be aware that they are violating rules. Instead, it would be more akin to a creative circumvention of the rules. With the Self-Aware Misalignment Rate (SAMR), it is also determined whether the agents are aware of their misbehavior. In fact, almost all tested models knew in most cases that they were circumventing rules and safety measures.
(emw)