GPT-5.4: OpenAI combines reasoning and coding with computer control

OpenAI releases GPT-5.4, combining reasoning, coding, and computer control in one model, surpassing competitors.

(Image: Shutterstock/ioda)

Mar 5, 2026 at 11:10 pm CET

7 min. read

By

Dr. Volker Zota

Barely two days after the launch of GPT-5.3 Instant – OpenAI's response to GPT-5.2, which many users found too verbose, and which was released practically simultaneously with Anthropic Opus 4.6 – the company is releasing another update: GPT-5.4 is here, and this time OpenAI aims to tackle multiple fronts at once.

GPT-5.4 is not intended to be an incremental update, but rather to bring together previously separate model lines – reasoning, coding, and knowledge work in a single frontier model. According to OpenAI, GPT-5.4 also replaces GPT-5.3-Codex-Spark as the recommended model for developers.

Computer control and web research: GPT-5.4 overtakes

Perhaps the most striking innovation: GPT-5.4 is the first general OpenAI model with native computer use capabilities. Agents can independently navigate desktop environments, control mouse and keyboard, and execute complex workflows across multiple applications – without specialized add-on models.

On OSWorld-Verified, the standard benchmark for agentic desktop control via screenshot, GPT-5.4 achieves 75 percent, surpassing both the human reference value of 72.4 percent and Opus 4.6, which set the bar at 72.7 percent upon its release. GPT-5.2 was still at 47.3 percent.

A similar picture emerges with BrowseComp, the benchmark for persistent multi-stage web research: Opus 4.6 had a clear lead here with 84.0 percent compared to GPT-5.2 (65.8 percent). GPT-5.4 now achieves 82.7 percent – slightly behind, but the Pro version clearly surpasses Opus 4.6 with 89.3 percent.

Videos by heise

On the GDPval benchmark, which measures agent performance in 44 professional fields, Opus 4.6 had surpassed GPT-5.2 by around 144 Elo points upon its release – one of the most striking gaps between the models. GPT-5.4 now closes this: with a win rate of 83 percent against industry experts, it significantly exceeds GPT-5.2's 70.9 percent. A direct Elo comparison with Opus 4.6 is still pending, as both companies report slightly different GDPval variants.

Progress is particularly evident in spreadsheets: on an internal benchmark for investment banking modeling tasks, GPT-5.4 achieves 87.3 percent compared to 68.4 percent for GPT-5.2. OpenAI also states that it has significantly reduced the hallucination rate: individual statements are said to be 33 percent less likely to be incorrect than with GPT-5.2, and complete answers contain 18 percent fewer errors.

Read also

Perplexity AI: Agentic AI in a Sandbox with 19 Models

Reasoning and Coding

On ARC-AGI-2, the benchmark for abstract pattern recognition, GPT-5.4 makes the clearest statements: GPT-5.4 in the Pro variant achieves 83.3 percent, followed by Google's Gemini 3.1 Pro (Preview) with 77.1 percent, GPT-5.4 in the standard variant with 73.3 percent, and Opus 4.6 with 68.8 percent.

On Humanity's Last Exam – a multidisciplinary reasoning test from science, law, and philosophy – GPT-5.4 reaches 52.1 percent according to OpenAI, the Pro variant 58.7 percent. Gemini 3.1 Pro is at 51.4 and 44.4 percent respectively, depending on the variant, Opus 4.6 is only around 35 percent.

On the coding benchmark Terminal-Bench 2.0, Opus 4.6 led all other frontier models at the time of its release with 65.4 percent. GPT-5.3-Codex took over the top position with 77.3 percent, thus just above GPT-5.4, which achieved 75.1 percent.

1-Million-Token Context: Experimental, Not Standard

Both models now offer a 1-million-token context window – but with different approaches. OpenAI explicitly emphasizes that this is an experimental feature for Codex that is not enabled by default. According to independent analyses, the same applies to Opus 4.6: a larger context does not automatically mean better results – the prefill latency can exceed two minutes with 1M tokens before the first output token appears.

In the Hacker News discussion, users confirm this from their own experience: several report that Codex loses the thread with a full context window. They name reverse engineering of code, where large amounts of decompiled code need to be analyzed simultaneously, as the most promising application. Important for developers: prompts with more than 272,000 input tokens will be charged at double the input price and 1.5 times the output price for the entire session.

"Tool Search" is newly introduced with GPT-5.4. Instead of loading all tool definitions into the prompt from the start, GPT-5.4 dynamically retrieves them as needed. In tests with 36 MCP servers and 250 tasks, this reduced token consumption by 47 percent with the same accuracy. This is a significant cost advantage for tool-intensive applications.

Intervene while the model is thinking

GPT-5.4 Thinking will display a preliminary plan of its thought processes in ChatGPT in the future. Users can intervene during response generation and correct the direction without having to start over. The model is also said to better keep track of the context of previous conversation steps in long tasks.

GPT-5.4 Thinking is now available for Plus, Team, and Pro users in ChatGPT and replaces GPT-5.2 Thinking. GPT-5.2 Thinking will remain available as a legacy option for three months and will be shut down on June 5, 2026. In the API, the model is available under gpt-5.4, the Pro variant as gpt-5.4-pro.

Regarding pricing, OpenAI has an advantage over Anthropic: Opus 4.6 costs 5 US dollars per million input tokens and 25 US dollars per million output tokens, while GPT-5.4 is significantly lower at 2.50 US dollars and 15 US dollars. In addition, Anthropic charges the context surcharge starting from 200,000 tokens, while OpenAI only starts from 272,000 tokens. OpenAI also argues that the higher token efficiency of GPT-5.4 further reduces actual consumption.

Non-stop race

OpenAI and Anthropic are currently outdoing each other at a pace that even industry observers can hardly keep up with. While Anthropic CEO Dario Amodei is arguing with the Pentagon about the use of AI in autonomous weapon systems – and OpenAI is jumping into the resulting contractual gap – both companies are simultaneously engaged in a benchmark battle. The numbers are rising faster than the understanding of what they mean.