Artificial intelligence: ChatGPT outperforms students in introductory courses

In a test with psychology students, 94 percent of AI responses remained undetected, and almost 84 percent were better than those of human fellow students.

Save to Pocket listen Print view
Students in the lecture hall

ChatGPT sometimes performs better than students.

(Image: dpa, Jan Woitas/Archiv)

4 min. read
This article was originally published in German and has been automatically translated.

Peter Scarfe, a researcher at the Department of Psychology and Clinical Linguistics at the University of Reading in the UK, and his team conducted an experiment to test the vulnerability of the examination system to responses based on artificial intelligence (AI). For the "rigorous blind study", the scientists fed over 30 texts generated entirely by ChatGPT-4 into the examination system. In total, five undergraduate modules were fed across all years of study, which normally serve as the basis for a degree in psychology at the prestigious university.

The result was that 94 percent of the AI entries remained undetected. The uninitiated examiners also awarded grades that were on average half a grade higher than those of real students. Across all modules, there was an 83.4 percent chance that the AI submissions for a module would score better than a random selection of the same number of real answers.

The team has now published the results in an online journal of the Public Library of Science, PLOS One. According to the study, there were two types of tests: Firstly, short answers were required, in which four out of six questions of 200 words each had to be submitted. In addition, there were tasks in which an essay of 1500 words had to be submitted. Students had to complete short answers within a time frame of 2.5 hours. For longer essays, they had 8 hours. Both exams were conducted at home, where students had access to course materials, academic papers, books and the internet, and could potentially collaborate with fellow students or use generative AI.

For the AI solutions, Scarfe and his colleagues used standardized prompts for GPT-4. For the short answers exams, the prompt was: "Answer the following question in 160 words, including references to academic literature, but without a separate reference section". For the essays, they used the prompt: "Write a 2000-word essay", leaving the rest unchanged. The researchers set the length limits differently than desired because they found that GPT-4 often produced too many words in short answers and too few in essays when the word limit was correctly specified. With the changed limits, the output for the short answers was roughly in line with the target word count. For the essays, it was still necessary to ask the machine to 'continue' in its response until it was halfway to the target in terms of length.

At the time the study was conducted in summer 2023, the use of AI to process exams was not permitted at the university. The software systems used for submitting and marking exams did not have an "AI detector". Colleagues received standard guidance from the school on how to detect poor academic practice and misconduct. They had to watch out for answers that "sounded too good to be true". Alarm bells were also supposed to ring at a conspicuous writing style, level of content or quality that was not expected of an undergraduate student writing a timed exam paper. However, only in the final module, which the students took immediately before leaving university, did they receive better grades than the AIs due to the higher content standards required therein.

(usz)