Model Show: Coding, OCR, and Chinese New Year

February brought new coding models, and vision-language models impress with OCR. Open Responses aims to establish itself as a unified API.

listen Print view
Chatbot and humans

(Image: pncha.me/Shutterstock)

11 min. read
By
  • Dr. Christian Winkler
Contents

Since the last model show at the end of January, a lot has happened with language models. Chinese New Year seems to play a big role, with providers releasing many models before it. But let's take it step by step!

Prof. Dr. Christian Winkler

Prof. Dr. Christian Winkler beschäftigt sich speziell mit der automatisierten Analyse natürlichsprachiger Texte (NLP). Als Professor an der TH Nürnberg konzentriert er sich bei seiner Forschung auf die Optimierung der User Experience.

As early as September 2025, Qwen announced models with a new, hybrid architecture. The only available model was Qwen3-Next-80B-A3B-Instruct, which was more of an experiment. However, Qwen has incorporated the presented architecture into the Qwen3 Coder-Next model. The number of (active) parameters also matches exactly. The hybrid attention layers are noteworthy, allowing for a very long context of 262.144 tokens, requiring not much memory, and thus hardly reducing computational speed.

This makes Qwen3-Coder-Next runnable quickly on one's own hardware if sufficient memory is available, which is likely the case for powerful Macs with Apple Silicon. The model has become a favorite among some developers for local operation. Some are so enthusiastic that they even use it outside of coding.

Online conference LLMs in Business
Chatbot surrounded by laptops

(Image: Golden Sikorka/Shutterstock)

The online conference LLMs in Business on March 19th shows how AI agents can take over work processes, how LLMs help extract data, and how to operate models efficiently in your own data center.

OpenAI had to catch up and released the model GPT-5.3-Codex. According to its own description, it is significantly faster than the previous model and better suited for agentic tasks. The new model can perform code reviews, and OpenAI has since supplemented it with the smaller model GPT-5.3-Codex-Spark. This is intended to make it suitable for real-time coding. OpenAI is certainly also feeling the price pressure from open models. Coding models notoriously produce many tokens (especially when using reasoning), which can manifest in very high costs.

Coding leader Anthropic has also created a new model with Claude Opus 4.6, which is excellent for coding tasks. Additionally, Opus 4.6 can perform financial analyses, create presentations, and handle many daily tasks. Not least for this reason, many use it for OpenClaw, which can quickly lead to unforeseeable costs. In both text and coding, Opus 4.6 is the undisputed winner on the Arena leaderboards.

Steve Yegge explained how to use coding models efficiently and what can be achieved with them in his widely acclaimed Gas Town, and implemented the corresponding tooling. Yegge does not shy away from warnings that the system should only be used if one has the necessary experience and is truly willing to engage with this new paradigm. Some of the suggestions are extreme, but it could still offer a glimpse into how agentic coding with LLMs might evolve in the future. Caution is advised, however, as Gas Town "burns" tokens – costs can practically explode when using an expensive model.

Through vision-language models, OCR has increasingly become a domain of large language models. After a period of relative quiet in this area, several new models have now appeared.

Videos by heise

The new GLM-OCR model from Z.ai is very popular. Although the provider is a newcomer in OCR models, according to benchmarks, the model at least overshadows the also new DeepSeek-OCR-2 and PaddleOCR-VL-1.5. The model cannot convert an iX page used in earlier tests completely error-free, but it handles columns very well – the result is only available as text, but it is well usable.

GLM-OCR can also interpret tables and formulas, but converting graphics into data is not yet possible.

However, DeepSeek-OCR-2 has also developed significantly compared to its predecessor and now uses a – interestingly old – Qwen-VL model as an encoder. The iX page is recognized perfectly:

DeepSeek-OCR2 recognizes the iX page very well (Fig. 1).

The converted Markdown also looks good.

PaddleOCR-VL-1.5 uses some new approaches like text spotting and can also recognize non-rectangular text boxes. It also focuses on tables, which it can assemble across multiple pages. PaddleOCR-VL-1.5 is the only one of the mentioned systems that can extract data from diagrams. It processes the iX page well and requires little memory, but calculates extremely slowly.

It would be interesting to know if the providers also use the texts extracted from PDFs as training data for their large language models. However, all of them remain silent on this.

The always active providers from China have outdone themselves in recent weeks. Allegedly, this is due to Chinese New Year, which is traditionally associated with holidays.

Kimi K2.5 is perceived by many as the currently strongest model with open weights. Moonshot released the model some time ago, but the technical information was sparse. This has now changed because the associated technical report is now available. The document reports extensively on the training and evaluation of the model. The training, in particular, is quite something, as Moonshot used multimodal data in both pre-training and reinforcement learning. This may explain why Kimi K2.5 ranks so high in the Vision leaderboard at arena.ai. Another special feature is Agent Swarm: the model can call agents in parallel, which greatly increases speed for complex tasks. Moonshot already considers these requirements during training. The authors also describe details of the training process but omit the required computing time. Compared to DeepSeek, the report is less in-depth, but many details are still very interesting.

With Step-3.5-Flash, another, previously largely unknown player enters the arena of large (Chinese) language models. Compared to Kimi K2.5, the model is quite small, even though it has 196 billion parameters (of which eleven billion are active). However, this size allows the model to be run in a quantized version on powerful (Mac) hardware. For such a small model, it produces very respectable results, but in initial tests it is also very strongly indoctrinated with Chinese content. When asked about the Heise Verlag, it gets the founding year and founder wrong. The model refuses to answer politically sensitive questions.

This is not the case for GLM 5.0 to the same extent. Z.ai is an established provider of open language models that is also very willing to provide information on politically sensitive topics. The community has eagerly awaited this model and was not disappointed. Not long after GLM 4.7, Z.ai delivers an extremely strong model that can compete with almost all commercial models, especially in coding. GLM 5.0 also has strong performance otherwise, but compared to its predecessor, it has more than doubled the number of parameters to 744 billion parameters (40 billion of which are active). It required a proud 512 GB of RAM on a Mac Studio with suitable quantization, if one does not want to incur even higher costs for GPUs. The model performs excellently in the Arena benchmarks. In our tests, it correctly identified the founding year and founder of Heise Verlag (as one of the few models).

MiniMax couldn't be left behind and also released a new model. MiniMax 2.5, with 230 billion parameters (ten billion active), is significantly smaller and can run on the CPU with 128 GB of RAM in a suitable quantization. It is not yet represented in many benchmarks, but the first results look good. In initial tests, MiniMax 2.5 also gives incorrect answers about Heise Verlag. On questions about politically sensitive topics in China, it remains neutral but very brief.

Less noticed, but still interesting, is the Nanbeige4.1-3B model. It is a "small" reasoning model with only three billion parameters, but it beats the much larger Qwen3 models with up to 32 billion parameters in certain benchmarks. It is the first small language model to master deep search and can call tools in up to 500 rounds. It will be interesting to see if other models can follow suit, or what capabilities the large models will achieve when they use similar mechanisms.

Long-awaited and released very recently, Qwen3.5 is now also available. The model is available in different sizes, although the smaller models are currently missing. However, it is already apparent that Qwen3.5 is very powerful and has made up a lot of ground compared to the previous version. The large Qwen3.5 models (like 122B) are almost in the same league as (the much larger) Stepfun. A more detailed analysis will follow in the next article.

The OpenAI API has established itself as an interoperability standard. Almost always, the completions resource is requested, although the name is no longer contemporary. The transfer of further parameters is also more historically grown than content-wise motivated. The interface does not support encryption in this form at all.

OpenAI originally addressed all these problems and created the Responses API, the further development of which, under the name Open Responses, has been taken over by the community. The new format also handles agents better and can thus bypass reasoning cycles. Among other things, the protocol specifies the maximum number of tools that can be called.

Many tools already support the new API. Standardization is not only sensible but important, as agentic interaction increasingly necessitates better configurability of interfaces.

The speed at which providers are introducing new models has actually increased in recent months. Whether this can continue is debatable. In any case, OpenAI is hiring less new staff. Among Chinese providers, it is much more opaque how long they can afford this. In particular, there is also a lack of revenue, which is much harder to generate with open models (and especially difficult outside of China).

In addition, there is the hype around OpenClaw as an agent. Operating it with open models is even possible autonomously, but even then, the security problems are significant. Reading the reports about it, one wonders if the technology is really mature enough to be released like this "off the leash." The discussions about guardrails take on a whole new dimension. This does not apply to all users: the US Department of Defense wanted to force Anthropic to disable these guardrails in the models they use. Anthropic remained steadfast. Although they have now lost their contract, they have overtaken ChatGPT in popularity ratings.

(olb)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.