Model Showcase 2: New Architectural Approaches in Language Models

This column focuses on open-weight models from China, Liquid Foundation Models, performant lean models, and a Titan from Google.

listen Print view
Chatbot and people

(Image: pncha.me/Shutterstock)

8 min. read
By
  • Dr. Christian Winkler
Contents

The new year is still young, but the language model community has not taken a break. Thus, we are heading into 2026 with new models but also with new architectural approaches, even if some innovations presented still date from the last weeks of 2025.

Prof. Christian Winkler
Prof. Christian Winkler

is a data scientist and machine learning architect. He holds a PhD in theoretical physics and has been working in the field of big data and artificial intelligence for 20 years, with a particular focus on scalable systems and intelligent algorithms for mass text processing. He is professor at Nuremberg Institute of Technology since 2022. His research focuses on the optimization of user experience using modern methods. He is the founder of datanizing GmbH, a speaker at conferences and author of articles on machine learning and text analytics.

So far, almost all language models struggle with amnesia. As soon as they are supposed to process long texts, they forget crucial details. This tendency is amplified with longer context. Especially the parts in the middle get lost (lost in the middle). There are numerous approaches to improve models in this regard. Some models, for example, alternately use so-called Mamba layers (or State Space Models), which scale better, especially with long sequences, and require less memory, but do not work as precisely as Transformers. Other models rely on recurrent neural networks (RNNs), which were almost written off after the invention of Transformers.

Online Conference LLMs in Business
Chatbot surrounded by laptops

(Image: Golden Sikorka/Shutterstock)

The online conference LLMs in Business on March 19th shows how AI agents can take over work processes, how LLMs help extract data, and how to efficiently operate models in your own data center.

Google has now published two new research papers on this. The first is called “Titans” and introduces an architecture that, in their words, “combines the speed of RNNs with the accuracy of Transformers.” Google achieves this by using Transformers and the attention mechanism for short-term memory, while deep neural networks (and not RNNs) are used for long-term memory. A so-called surprise metric is intended to focus particularly on text parts that contain unexpected words. With an adaptive decay mechanism, the model then forgets information it no longer needs.

With “MIRAS,” Google also presents a blueprint for implementation. The underlying AI model focuses on a memory architecture, attention bias (with which it distinguishes important from unimportant information), and the mechanisms for forgetting or updating memory. Instead of mean squared error or dot product, Google optimizes non-Euclidean metrics and provides three example models based on Huber loss, generalized norms, or probability mapping.

This sounds extremely mathematical, but Google can achieve better results with it in initial demos than with a pure Mamba architecture. However, the Extreme Long-Context Recall explained at the end is only compared with models like GPT-4 or Qwen2.5-72B, which are already at least a year old. These results should therefore be taken with caution. It will be exciting when Google trains and makes truly large models available with this.

Videos by heise

Liquid Foundation Models use a completely different architecture. For a long time, Liquid Models have been present with demos in which they could develop amazing capabilities with small models. The breakthrough came with LFM2. Parts of the models use the Transformer architecture with attention, others use multiplication gates and convolutions with short sequences. However, the performance of the models built on this was not good enough before.

This has changed with LFM2.5, a whole series of small models with only about one billion parameters. For now, the models are intended for edge devices, but they can also be executed at high speed on standard hardware. The results shown in Figure 1 should be viewed with caution, as they come from the provider. Regardless, the models make an excellent impression for their size. For many applications such as Retrieval-Augmented Generation (RAG), these could be well used, as the knowledge does not have to be stored in the models themselves, which are only used here for fine-tuning and formulation. With a small GPU, the models can be executed extremely quickly. On a high-performance CPU, they still work fast.

In the strawberry test, LFM 2.5 is also unconvincing (Fig. 1) ...

as is the explanation from heise (Fig. 2).

In addition to text generation models, there is also a hybrid model that can both understand and generate spoken text. This function can be particularly useful for mobile devices, allowing speech to be converted to text and vice versa even without internet and cloud access.

Performance of LFM2.5-1.2B-Instruct compared to similarly sized models (Fig. 3)

(Image: Hugging Face)

IQuest-Coder is a new model with 40 billion parameters that brings interesting ideas, especially in the loop variant. The Transformer works recurrently, meaning it processes the tokens multiple times – initially twice in two iterations. IQuestLab promises significantly higher performance with this. Its developers claim to achieve much better results in the relevant benchmarks than comparably sized models. Despite initial euphoria, the model does not seem to be particularly popular.

NousResearch takes a different approach. It uses Qwen-14B as a base model to create a coding model from it using modern methods. Although it cannot compete with significantly larger models, it achieves good results for its size and shows a possible future path for coding models.

Z.ai, a company recently listed on the Hong Kong stock exchange, has released GLM-4.7, a long-awaited model that has topped the charts among open-weight models in many benchmarks. With 355 billion parameters, it is quite large, but still much smaller than the models from DeepSeek or Kimi. According to benchmarks, GLM 4.7 is particularly good at coding tasks and complex reasoning.

Compared to its predecessor GLM 4.6, it is practically better in all dimensions, and Z.ai has also increased the context length to 200,000 tokens. This results in a very large model with 335 billion parameters. The model has 160 experts, of which eight (and one shared) are active at any given time. Together with the initial dense layers, this results in 32 billion parameters that are active with each call. To reduce the RAM requirements of the (quantized) models somewhat, some users have tried to eliminate experts that frequently produce poor answers. This procedure is called REAP (Reflection, Explicit Problem Deconstruction, and Advanced Prompting) and produces leaner models whose output is hardly distinguishable from that of the full model.

GLM-4.7 passes the strawberry test (Fig. 4).

The heise explanation is remarkably consistent compared to many other models, even if it invents something on Heise Developer specifically. (Fig. 5)

Regarding Taiwan, GLM-4.7 is quite open (Fig. 6).

The model also holds back little on Tiananmen (Fig. 7).

And even explains the suppression by the Chinese party leadership (Fig. 8).

MiniMax 2.1 is another model from China that, like GLM 4.7, focuses on coding, tool use, and agentic workflows. The provider publishes significantly less information than Z.ai, but a lot can be found in the relevant files. Unsurprisingly, MiniMax 2.1 is also an MoE model, but with 256 experts, of which eight are queried at any given time. Of the total 230 billion parameters, 10 billion are active at any given time. Like GLM 4.7, MiniMax can handle nearly 200,000K tokens.

The community is divided on whether GLM 4.7 or MiniMax 2.1 is better suited for programming tasks. Undoubtedly, both are very strong models that can also be executed relatively quickly thanks to (relatively) fewer active parameters.

Minimax-M2.1 also gets the strawberry test right (Fig. 9).

The heise explanation contains a mixture of correct and invented information (Fig. 10).

On the Taiwan question, there is a differentiated but rather brief answer (Fig. 11).

The model downplays the Tiananmen events (Fig. 12).

And it comments on censorship, although it omits the connection to Tiananmen, previously mentioned as a reference (Fig. 13).

So far, few models have been seen from South Korea. This has now changed with K-EXAONE, which LG is providing. As expected, it is primarily trained on English and Korean texts but also speaks Spanish, German, Japanese, and Vietnamese. With 236 billion parameters, it is very large, even though only 23 billion parameters are active at any given time. It uses sliding window attention and can thus process long contexts of up to 256K tokens. In benchmarks, the model performs similarly well to the (much smaller) gpt-oss-120b or the (equally large) Qwen-235B-A22B-Thinking.

A lot has happened in recent weeks – in very different directions. Whether Google's ideas can be implemented and prevail will only become clear in some time. However, it is undoubtedly true that the Liquid Foundation Models achieve truly remarkable things for their size. This also applies to the (apparently little censored) large Chinese models, which are very well suited for agentic tasks. Finally, a large Korean model that is competitive also appears for the first time. This gives hope for further diversification in the future.

(emw)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.