Model Showcase: TurboQuant, Gemma, and DeepSeek v4

Google is releasing new Gemma models and a new algorithm, DeepSeek v4 is finally available, and Anthropic is making headlines multiple times.

listen Print view
Chatbot and humans

(Image: pncha.me / Shutterstock.com)

13 min. read
By
  • Dr. Christian Winkler
Contents

A lot has happened again in the world of large language models in a short time in recent weeks. In addition to the newly introduced TurboQuant method, Google has released another series of open-weight language models with Gemma 4. Significantly later, the long-awaited DeepSeek v4 was launched, at least in a preview version.

Prof. Christian Winkler
Prof. Christian Winkler

is a data scientist and machine learning architect. He holds a PhD in theoretical physics and has been working in the field of big data and artificial intelligence for 20 years, with a particular focus on scalable systems and intelligent algorithms for mass text processing. He is professor at Nuremberg Institute of Technology since 2022. His research focuses on the optimization of user experience using modern methods. He is the founder of datanizing GmbH, a speaker at conferences and author of articles on machine learning and text analytics.

Meanwhile, Anthropic is keeping its Mythos language model a big secret, and Qwen is releasing an incremental update to its successful Qwen3.5 series.

Many hailed the following as a revolution that would finally allow large models to run on smaller and more affordable graphics cards: Google has invented a new quantization method called TurboQuant, which causes almost no quality loss.

What can be easily overlooked: Google has aligned the algorithm with the key-value cache, which requires a lot of memory during the inference phase of LLMs, as the necessary data grows quadratically with the context length. And the algorithm itself is not entirely new either.

TurboQuant can compress the key-value cache almost losslessly: where sixteen bits were previously needed, only four bits – or in exceptional cases, even just three bits – are now necessary. This saves a lot of space and allows for longer context lengths. The algorithm uses a clever rotation of vectors to compress them with less loss.

Google shows in the blog post on TurboQuant that the quantization of the KV cache hardly increases perplexity, i.e., the model's uncertainty. This means that there should be hardly any noticeable differences in daily use of the language models. In fact, the first paper on TurboQuant was published in April 2025, but there were no usable implementations until now, and test cases were particularly lacking.

Videos by heise

Meanwhile, many software packages, such as llama.cpp or transformers from Hugging Face, support the TurboQuant cache. In some cases, you still have to tinker a bit and install additional packages, but the memory savings are clearly measurable. The first implementations are also available for Apple's MLX framework.

It will be interesting to see what TurboQuant is suitable for beyond the KV cache. Google mentions vector databases, but there are already other effective quantization methods for that. Whether the weights of the language models can be quantized with TurboQuant in such a way that they also function effectively in inference remains to be seen.

Google has been active not only methodologically but has also released new open-weight models. The long-awaited Gemma-4 series contains many innovations and is available in sizes of effectively 2, 4, and actual 31 billion parameters, plus an MoE (mixture of experts) model with 26 billion parameters, of which 4 billion are active at a time. Google is a bit tricky here, as the number of effective parameters is significantly lower than the number actually required in memory. The smaller model has about five billion parameters instead of two, and the larger one has eight billion instead of four.

All Gemma-4 models are multimodal and optimized for agentic tasks, and they can also interpret images. What is now considered standard is by no means always the case, as the look at DeepSeek v4 will soon show. Gemma-4 models also master reasoning. To avoid too many tokens in the response, the reasoning intensity can be adjusted.

Like many other providers, Google is also tweaking attention to save memory and computation time. The hybrid attention mechanism in Gemma 4 alternates between layers with full attention and those with sliding window attention, where the model only uses a specific window of tokens at a time.

With Gemma 4, Google has undoubtedly achieved a major success. Especially with open models in the range of 30 to 40 billion parameters, Alibaba's Qwen3.5 was the undisputed leader until now. The situation is no longer so clear-cut, as Gemma 4 can definitely score points here.

Over a year has passed since the release of DeepSeek-R1 in January 2025. The community has been waiting this long for DeepSeek v4. After some premature false reports about the release, the model has finally been released as a preview. The wait was worth it, as DeepSeek has released two models with v4, which were created through an elaborate training process (pre-training with 32 trillion tokens, complex multi-stage post-training). One is very large with 1.6 trillion parameters; the size has more than doubled compared to its predecessor, even though only 49 billion are active here. This Pro model is flanked by a Flash model with only 284 billion parameters, of which 13 billion are active. This is still very large, but it can be run on powerful PCs with enough RAM. The Mac Studio computers (currently unavailable) are particularly suitable.

DeepSeek has once again tinkered with attention. After the predecessor model introduced Multi-Head Latent Attention, DeepSeek v4 combines two new attention mechanisms. The details of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) remain somewhat unclear for now, but DeepSeek is said to require only 27 percent of the floating-point operations of the predecessor model for single-token inference. The KV cache even shrinks by 90 percent. It would be interesting to investigate whether this can also be combined with TurboQuant.

DeepSeek further tweaks the architecture and introduces the so-called Manifold-Constrained Hyper-Connections (mHC), where certain connections between layers are more pronounced to increase stability in forward propagation. Similar to Moonshot with Kimi, DeepSeek now also uses the Muon optimizer instead of the usual AdamW to optimize weights even faster during training. The fact that these are now stored mixed as FP4 and FP8 also saves a lot of memory. The large model requires only a few bytes more than the v3.2 model, and the Flash model even comes with 158 GB, which would otherwise only be possible for a model with 284 billion parameters through strong post-training quantization.

Surprisingly, DeepSeek v4 is not a multimodal model but can only handle text. Perhaps DeepSeek will add this by the final release, but unfortunately, there are no forecasts yet as to when that might be. DeepSeek is unusually reticent with further information anyway; details about training and architecture are not yet publicly available.

The comparison between the small and large (or rather, between the huge and the gigantic) model is exciting. The smaller Flash model probably produces similarly good results if you allow it more reasoning. Thus, one exchanges the number of parameters for the length of the answer in logic questions. How well this works remains to be seen in a detailed test. For knowledge questions, the smaller model is, as expected, inferior.

DeepSeek v4 supports three different thinking modes, from reasoning switched off to very intensive. In terms of performance compared to v3.2, the progress is moderate. However, it is surprising that the Flash model works almost as well as v3.2 with twice as many parameters (in earlier, much higher precision, DeepSeek v3 still used FP8 everywhere).

DeepSeek v4 is an interesting model, but it is disappointing that the manufacturer has published significantly fewer technical details than usual. Hopefully, a paper will be published on this for the final release.

Anthropic is also in the headlines, but not only positively. In March, the company lost the source code for Claude. Independently, Anthropic is researching increasingly powerful models – so powerful that the security vulnerabilities they (automatically) discover can become dangerous. Therefore, Mythos was not made publicly available, but it was apparently still accessible to unauthorized individuals. How powerful the model actually is is shown, among other things, by the fact that it uncovered over 270 vulnerabilities in Firefox. It is quite possible that this will lead to a completely new direction in cybersecurity.

The Chinese providers beyond DeepSeek are also not sleeping. Qwen has released a minor update for some Qwen3.5 models. Qwen3.6 is available as a Max Preview only via API; smaller models like the MoE with 35 billion parameters, of which three billion are active, and the dense model with 27 billion parameters are also openly available. The latter model, in particular, proves to be extremely powerful and outperforms significantly larger models. How well this works remains to be seen in more detailed analyses.

Qwen3.5 uses so-called Mamba layers in its hybrid attention. For state-space models, there is now a new architecture with Mamba-3. This may allow for further improvements in the corresponding layers, enabling longer contexts to be mapped. However, there are not yet any powerful models with this architecture.

The company Tesslate has fine-tuned a model based on Qwen3.5-9B with data generated by Claude Opus for coding tasks. This resulted in the OmniCoder model, which is extremely well-suited for coding tasks and outperforms the base model in all dimensions. It is also small enough to run locally in a quantized version.

Moonshot has followed up with Kimi K2.6. The model focuses particularly on coding and agentic tasks, which it can perform in swarms of up to 300 agents. As usual, the model is very large with one trillion parameters and can only be run with difficulty on affordable hardware.

MiniMax has a new version 2.7 of its MoE model. What is remarkable about it is that the model was used for its update. It updated its memory, “invented” dozens of skills for reinforcement learning, and autonomously improved its optimization process. MiniMax AI claims to have achieved up to 30 percent performance improvement. It will be exciting to see if other providers go a similar route and improve their models solely through this.

As models continue to grow, providers are exploring new quantization methods alongside the ubiquitous attention mechanism. An interesting candidate is JANG quantization, which dynamically selects the optimal quantization level, thus saving enormous space. Currently, the software is only available on Macs. There, it is possible to run a model with 397 billion parameters with 128 GB of RAM. My own tests have shown that this works surprisingly well and is relatively easy to install with MLX Studio. The corresponding Python tools are, however, not always completely up-to-date.

OpenAI has publicly released GPT-5.5. It excels particularly in coding with new best values and can also handle multi-stage tasks well. Unlike GPT 5.4, it also seems to cope better with foreign languages, as the first sentence of the previous blog post sounded at least strange in German: “Today we are releasing GPT-5.4 in ChatGPT (as GPT-5.4 Thinking), the API and Codex.”

Even though most of the work in language models happens in China and the USA, there is news from the rest of the world, including Europe. Cohere from Canada and the German company Aleph Alpha are planning a merger. Together, they want to create a language model champion. A while ago, Cohere's models were at the forefront, but meanwhile, they have become almost insignificant. Perhaps a restart is needed, which might be achieved through the merger.

What a spring! Great progress can be observed in language models. If you combine the ideas from recent weeks, such as clever attention mechanisms, improved quantization, and optimized training, you can expect significantly better models and further innovations in the near future.

Particularly exciting is that the growth curve for memory requirements is flattening. This gives hope that at least some of these models will be able to be used confidently on powerful own hardware in the future.

(mki)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.