Anthropic introduces Claude Opus 4.6 with Agent Teams

The new AI model Claude Opus 4.6 brings improved coding capabilities, a larger context window, and for the first time, an "Agent Teams" feature.

(Image: Anthropic)

Update Feb 5, 2026 at 10:51 pm CET

4 min. read

By

Dr. Volker Zota

Anthropic has introduced the new AI model Opus 4.6, which is said to perform significantly better than its predecessor, primarily in programming. Opus 4.6 is the first version of the Opus class with a context window of one million tokens – however, still as a beta feature. Further innovations: Agentic coding teams are intended to process complex tasks in parallel, Claude automatically adjusts thinking time to the query, and the maximum output length is doubled. The new Opus version is also more powerful.

Coordinating Multiple AI Instances

A central innovation is the Agent Teams feature in Claude Code, which is currently in research preview. This allows multiple Claude Code instances to be run and coordinated in parallel – similar to the Codex app from OpenAI, recently introduced. A lead session coordinates the work, assigns tasks, and summarizes results. Almost simultaneously with the introduction of Opus 4.6, OpenAI released the updated version GPT-5.3 Codex. It is intended to merge GPT-5.2 and GPT-5.2 Codex and be 25 percent faster than the previous version.

The individual team members are independent sessions with their own context window. They can communicate directly with each other and access a shared task list. Team members can assign tasks to themselves or be assigned tasks and work on different problems in parallel. The feature is activated via the environment variable CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. Agent Teams incur higher token costs because each instance is billed separately. They are intended for complex collaboration where multiple perspectives or parallel solution approaches are required.

In contrast to agentic teams, subagents work within a single session and only return their results to the requesting agent. Anthropic sees these subagents more for focused individual tasks.

Read also

ChatGPT und Gemini im Vergleich: Duell der KI-Titanen

Opus 4.6 brings further new features: "Context Compaction" summarizes old context information to make space for new inputs. "Adaptive Thinking" automatically extends the model's thinking time when complex tasks require it. Developers can also choose between four effort levels (low, medium, high, max) to control computational effort. The maximum output length has been increased to 128,000 tokens.

Benchmark Leader

According to Anthropic, Opus 4.6 leads various mandatory benchmarks. On Terminal-Bench 2.0, which tests agent-based programming, Opus 4.6 achieves the highest score of all models, according to Anthropic. It also leads in the "Humanity's Last Exam" reasoning benchmark. The advantage in the GDPval-AA test, which checks how well AI models can perform economically relevant work tasks, is particularly significant. Here, Opus 4.6 surpasses OpenAI's GPT-5.2 by 144 Elo points and the direct predecessor Opus 4.5 by 190 Elo points.

In processing long contexts, there is significant progress compared to the predecessor: In the MRCR v2 8-needle 1M test, Opus 4.6 achieves a success rate of 76 percent, while Sonnet 4.5 only reaches 18.5 percent. The BigLaw Bench attests the model the highest score ever achieved by a Claude model with 90.2 percent – 40 percent of the answers were perfect, 84 percent achieved a rating of at least 0.8.

Read also

An open laptop is being operated by a person in a blue shirt; the lettering ChatGPT and some abstract symbols float above the keyboard

GPT-5.2: New AI model from OpenAI to better support office work

Regarding security, Opus 4.6 is at the level of other frontier models, according to the published System Card. The rate of misaligned behavior such as deception or excessive adaptation to user requests is low. The model shows the same alignment as Opus 4.5, which was previously considered the best aligned, but has lower over-refusal rates. For cybersecurity, Anthropic has developed six new test scenarios. The model meets Anthropic's ASL-3 standard.

Prices are $5 per million input tokens and $25 per million output tokens. For premium requests with more than 200,000 tokens, prices increase to $10 and $37.50 respectively. In the future, customers will also have to pay a ten percent surcharge if inference is to run exclusively in the USA.

Update 05.02.2026, 21:43 Uhr

Note on the release of OpenAI GPT-5.3 Codex added.