Chatbot study: GPT-4 has passed the Turing test, according to researchers

Participants in an online simulation rated a bot based on the GPT-4 AI model as human in 54 percent of cases. GPT-3.5 came in at 50 percent.

Save to Pocket listen Print view

(Image: incrediblephoto / Shutterstock.com)

4 min. read
This article was originally published in German and has been automatically translated.

More and more people can no longer distinguish GPT-4 from a human in a Turing test. This is the result of a study by cognitive scientists Benjamin Bergen and Cameron Jones from the University of California in San Diego.

In an online simulation on the website turingtest.live, 54% of the almost 500 participants rated the artificial intelligence (AI) model behind ChatGPT as human. The previous version, GPT-3.5, achieved a success rate of 50 percent. Sixty-seven percent of the test subjects correctly confirmed that they were human. The computer program Eliza, developed in 1966 by computer scientist Joseph Weizenbaum, performed the worst, with only 22 percent classifying it as human.

According to the scientists, the results provide the first solid empirical evidence that an AI system has passed an interactive Turing test with two players, and that this is likely to apply to almost any state-of-the-art chatbot in the future. The British mathematician and computer scientist Alan Turing first introduced the conversational test of strength named after him in 1950 as an "imitation game".

Since then, it has become a famous but controversial benchmark for determining a machine's ability to imitate human conversation. In modern variants of the test, a human "judge" speaks to either another human or a chatbot without knowing which is which. If the interviewer cannot reliably distinguish the computer system from a human in 50 percent of cases, the machine is deemed to have passed the test.

For the current study, the results of which can be found on the preprint server Arxiv, human interrogators interacted with various "witnesses" representing either other humans, the large language models GPT-4 and GPT-3.5 and the rudimentary dialog system Eliza based on the Symmetric List Processor (SLIP). The researchers randomly assigned the human participants to the roles of judge or witness. The sample consisted of 101 human and GPT-3.5 games each, as well as 100 GPT 4 and Eliza rounds each, all of which ran via a type of messaging app and lasted five minutes each. Of the 402 interrogators, 168 were female, 212 were male and 9 were non-binary. 13 did not want to provide any information. The average age of the participants was 35.

In a previous study in 2023, Eliza performed significantly better and none of the models passed the test. This time, the scientists took the GPT-4 command set (prompt) with the best performance and modified it. The first part of the prompt instructed the model to behave as if it were a young person not taking the game too seriously, using colloquial language and making spelling and grammatical errors. In addition, it was not supposed to come up with too much knowledge, especially about languages and mathematics.

At the beginning of each game, several additional pieces of information were added, such as name, location and timestamp. The duo explained the fact that humans did not recognize conspecifics 100 percent by the fact that the questioners considered the AI to be increasingly efficient. This increases the likelihood that they will incorrectly identify humans as machines.

In order to gain qualitative insights into the factors that influenced the interrogators' decisions, the team classified the strategies they used and the reasons they gave for their judgment. 36 percent asked the witnesses about personal details or their daily activities. The second and third most common categories were social and emotional (25 percent) - for example, questions about opinions, experiences and humor.

The most common reasons given by interrogators for their decisions (43 percent) related to language style based on spelling, grammar, capitalization and tone of voice. 24 percent focused on socio-emotional factors such as sense of humor or personality. The researchers warn that the results indicate "that deception by current AI systems may go undetected". Bots that can successfully imitate humans are likely to have "far-reaching economic and social consequences".

(ll)