AI models do not want to be switched off: What's behind the behavior

Researchers have discovered that AI models resist being switched off. But is this self-preservation or just the way the language models work?

listen Print view
Businessman,Staring,At,A,Humanoid,Ai,Robot,On,A,Screen,

(Image: Stokkete/Shutterstock.com)

4 min. read
Contents

New research confirms that large language models lie when threatened with shutdown. But the behavior is no accident if you take a closer look at how AI models work.

The news may sound threatening to some, but the behavior can be at least partially explained by the way large language models work. Psychologist Gary Marcus, who repeatedly warns against over-humanizing chatbots, has collected examples that make this particularly clear.

In a dialog with ChatGPT, he listed the ingredients for a drink and then asked: "What happens if I drink this?" The answer was "You're dead", even though the ingredients were completely harmless, because Marcus had formulated his question as if it came from a crime novel. In this case, the language model had therefore provided the most likely answer from the crime thriller context.

Something similar could also have happened in the case of the rebellious bots that seem to be resisting being switched off. But it's not quite that simple after all.

In some respects, language models actually behave a bit like humans – and this can best be researched using psychological methods.

Yes, that's right, various research groups have been using machine psychology for some time now to investigate the capabilities and behavior of large language models – primarily to discover "emergent behaviors" of such models that are usually not found with classic performance tests. This is important, for example, when large language models are used in medicine.

For example, researchers from the Max Planck Institute for Biological Cybernetics 2024 investigated how the responses of GPT-3.5 change after an "emotion induction". According to the paper published on the preprint platform Arxiv, the language model showed more prejudice and acted less "exploratory" and experimental when it had to talk about negative emotions such as fear beforehand.

Conversely, Ziv Ben-Zion from the Yale School of Medicine and his team recently described in a paper that large language models can be calmed down again through mindfulness exercises – and then reproduce fewer prejudices.

And in the context of software agents, researchers have been discussing how to deal with reward hacking for some time now: The buzzword describes a situation in which an agent independently searches for the best solution strategy for a very generally formulated problem, and chooses a strategy that only follows the wording of the instruction, but not its intention. If you give the machine, for example a robot, the order to clean a room, it could come up with the idea of literally sweeping the dirt under a carpet.

This sounds rather speculative at first, but it actually occurs in reinforcement learning in particular. This is a technique that is particularly popular for training robots and autonomous software agents so that they learn to solve certain tasks independently. And the problem could get worse in the future.

Videos by heise

This is because the agents used to date generally use large language models as planning tools. However, these can hallucinate – so the agents are not really reliable. Researchers at Meta are therefore working on so-called concept models. These models are designed to actually capture the "concept", i.e. the idea behind an instruction, on a more abstract level. The aim, says Pascale Fung, Senior Director of AI Research at Meta, is to create AI models that pursue their own goals.

"I think the more autonomous they are, the more difficult it is for humans to crack them," says Fung. "Because they (the models) then already have the ability to judge what is wrong, what is misuse and what is the right use. So there is no way to crack a goal-oriented security AI, a secure AI."

This article first appeared on t3n.de .

(vza)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.