GPT-5.4: OpenAI combines reasoning and coding with computer control

OpenAI releases GPT-5.4, combining reasoning, coding, and computer control in one model, surpassing competitors.

listen Print view
The OpenAI logo on the facade of the office building in San Francisco.

(Image: Shutterstock/ioda)

7 min. read
Contents

Barely two days after the launch of GPT-5.3 Instant – OpenAI's response to GPT-5.2, which many users found too verbose, and which was released practically simultaneously with Anthropic Opus 4.6 – the company is releasing another update: GPT-5.4 is here, and this time OpenAI aims to tackle multiple fronts at once.

GPT-5.4 is not intended to be an incremental update, but rather to bring together previously separate model lines – reasoning, coding, and knowledge work in a single frontier model. According to OpenAI, GPT-5.4 also replaces GPT-5.3-Codex-Spark as the recommended model for developers.

Perhaps the most striking innovation: GPT-5.4 is the first general OpenAI model with native computer use capabilities. Agents can independently navigate desktop environments, control mouse and keyboard, and execute complex workflows across multiple applications – without specialized add-on models.

On OSWorld-Verified, the standard benchmark for agentic desktop control via screenshot, GPT-5.4 achieves 75 percent, surpassing both the human reference value of 72.4 percent and Opus 4.6, which set the bar at 72.7 percent upon its release. GPT-5.2 was still at 47.3 percent.

A similar picture emerges with BrowseComp, the benchmark for persistent multi-stage web research: Opus 4.6 had a clear lead here with 84.0 percent compared to GPT-5.2 (65.8 percent). GPT-5.4 now achieves 82.7 percent – slightly behind, but the Pro version clearly surpasses Opus 4.6 with 89.3 percent.

Videos by heise

On the GDPval benchmark, which measures agent performance in 44 professional fields, Opus 4.6 had surpassed GPT-5.2 by around 144 Elo points upon its release – one of the most striking gaps between the models. GPT-5.4 now closes this: with a win rate of 83 percent against industry experts, it significantly exceeds GPT-5.2's 70.9 percent. A direct Elo comparison with Opus 4.6 is still pending, as both companies report slightly different GDPval variants.

Progress is particularly evident in spreadsheets: on an internal benchmark for investment banking modeling tasks, GPT-5.4 achieves 87.3 percent compared to 68.4 percent for GPT-5.2. OpenAI also states that it has significantly reduced the hallucination rate: individual statements are said to be 33 percent less likely to be incorrect than with GPT-5.2, and complete answers contain 18 percent fewer errors.

On ARC-AGI-2, the benchmark for abstract pattern recognition, GPT-5.4 makes the clearest statements: GPT-5.4 in the Pro variant achieves 83.3 percent, followed by Google's Gemini 3.1 Pro (Preview) with 77.1 percent, GPT-5.4 in the standard variant with 73.3 percent, and Opus 4.6 with 68.8 percent.

On Humanity's Last Exam – a multidisciplinary reasoning test from science, law, and philosophy – GPT-5.4 reaches 52.1 percent according to OpenAI, the Pro variant 58.7 percent. Gemini 3.1 Pro is at 51.4 and 44.4 percent respectively, depending on the variant, Opus 4.6 is only around 35 percent.

On the coding benchmark Terminal-Bench 2.0, Opus 4.6 led all other frontier models at the time of its release with 65.4 percent. GPT-5.3-Codex took over the top position with 77.3 percent, thus just above GPT-5.4, which achieved 75.1 percent.

Both models now offer a 1-million-token context window – but with different approaches. OpenAI explicitly emphasizes that this is an experimental feature for Codex that is not enabled by default. According to independent analyses, the same applies to Opus 4.6: a larger context does not automatically mean better results – the prefill latency can exceed two minutes with 1M tokens before the first output token appears.

In the Hacker News discussion, users confirm this from their own experience: several report that Codex loses the thread with a full context window. They name reverse engineering of code, where large amounts of decompiled code need to be analyzed simultaneously, as the most promising application. Important for developers: prompts with more than 272,000 input tokens will be charged at double the input price and 1.5 times the output price for the entire session.

"Tool Search" is newly introduced with GPT-5.4. Instead of loading all tool definitions into the prompt from the start, GPT-5.4 dynamically retrieves them as needed. In tests with 36 MCP servers and 250 tasks, this reduced token consumption by 47 percent with the same accuracy. This is a significant cost advantage for tool-intensive applications.

GPT-5.4 Thinking will display a preliminary plan of its thought processes in ChatGPT in the future. Users can intervene during response generation and correct the direction without having to start over. The model is also said to better keep track of the context of previous conversation steps in long tasks.

GPT-5.4 Thinking is now available for Plus, Team, and Pro users in ChatGPT and replaces GPT-5.2 Thinking. GPT-5.2 Thinking will remain available as a legacy option for three months and will be shut down on June 5, 2026. In the API, the model is available under gpt-5.4, the Pro variant as gpt-5.4-pro.

Regarding pricing, OpenAI has an advantage over Anthropic: Opus 4.6 costs 5 US dollars per million input tokens and 25 US dollars per million output tokens, while GPT-5.4 is significantly lower at 2.50 US dollars and 15 US dollars. In addition, Anthropic charges the context surcharge starting from 200,000 tokens, while OpenAI only starts from 272,000 tokens. OpenAI also argues that the higher token efficiency of GPT-5.4 further reduces actual consumption.

OpenAI and Anthropic are currently outdoing each other at a pace that even industry observers can hardly keep up with. While Anthropic CEO Dario Amodei is arguing with the Pentagon about the use of AI in autonomous weapon systems – and OpenAI is jumping into the resulting contractual gap – both companies are simultaneously engaged in a benchmark battle. The numbers are rising faster than the understanding of what they mean.

(vza)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.