GPT-OSS: Insight into the open models of OpenAI

OpenAI has revealed the weights of a new model for the first time. GPT-OSS focuses primarily on efficiency and brings a new approach to prompting.

(Image: PopTika / Shutterstock.com)

Aug 8, 2025 at 9:03 am CEST

11 min. read

Developer

By

Dr. Christian Winkler

The long wait for the first OpenAI model with open weights has come to an end: OpenAI released GPT-OSS on August 5. A closer look reveals that the wait was worth it. The model works excellently and contains many innovations. It is also available under the very liberal Apache 2.0 license.

is a data scientist and machine learning architect. He holds a PhD in theoretical physics and has been working in the field of big data and artificial intelligence for 20 years, with a particular focus on scalable systems and intelligent algorithms for mass text processing. He is professor at Nuremberg Institute of Technology since 2022. His research focuses on the optimization of user experience using modern methods. He is the founder of datanizing GmbH, a speaker at conferences and author of articles on machine learning and text analytics.

Architecture of the models

OpenAI has actually published not one, but two models. In addition to the large 120B model with 117 billion parameters, there is also a small 20B model with 21 billion parameters.

Both models use the mixture-of-experts architecture and therefore require significantly fewer active parameters to be included in the calculation during the inference phase. This is particularly pronounced in the large model, which uses only four of its 128 experts simultaneously. As a result, there is no major difference in the number of active parameters between the two models. The smaller model is therefore not much faster, but requires significantly less RAM (more on this later).

Modell	GPT-OSS-120B	GPT-OSS-20B
Anzahl Parameter	117 Milliarden	21 Milliarden
Anzahl aktive Parameter	5,1 Milliarden	3,6 Milliarden
Anzahl Layer	36	24
Anzahl Experten	128	32
Anzahl aktive Experten	4	4
Anzahl Attention Heads	64	64

Videos by heise

The architecture of the layers is interesting: OpenAI alternately uses a full attention, i.e. a view of the entire content, and one with the so-called sliding window, in which it divides the content into smaller, overlapping segments. This variant requires significantly less memory and computing time, but is less capable of handling long contexts. This compensates for the full attention in the layers in between.

Less memory required, more flexible reasoning

The model card on Hugging Face states that the large model can be executed on an H100 GPU. This is surprising at first, as 121 billion parameters are too large even in the economical FP8 format (8-bit floating point) used by DeepSeek. However, OpenAI has saved even more and published the weights in the even more compact MXFP4 format (Microscaling 4-bit Floating Point), which only requires half as much memory. This means that the model only requires 60 GB of RAM for the weights. The disadvantage of this is that only the Hopper GPUs from Nvidia used in H100 or RTX 5090 cards can calculate efficiently with this format.

Although the models run on older generation GPUs, they require four times as much memory. It's a prankster who thinks of cross-sponsoring with Nvidia. Nevertheless, it is remarkable that the established bfloat16 format has now been reduced to four bits within just one year (at least for these models), meaning that only a quarter of the memory space is required.

OpenAI also allows the reasoning of the GPT-OSS models to be configured. This means you can specify how detailed the models should expose their thoughts. This is extremely useful because some models are too talkative in reasoning mode and generate a lot of tokens. So not only do you have to read long explanations and wait for them to be generated, you also have to pay for plenty of tokens. How well this setting really works remains to be seen in practice.

The new Harmony Response Format

With the hybrid Qwen3 models from Alibaba, reasoning can be switched off by specifying /no_think in the prompt, which is not very flexible. OpenAI has put more thought into this and defined a new chat format: The Harmony Response Format is much more flexible than all previous chat templates and allows for many ways of interacting with the models.

– On closer inspection, it is almost astonishing that the – chat templates, which now seem outdated, have been retained for so long. It is exciting that when trying out the Harmony code, the knowledge cut-off of GPT-OSS is June 2024, meaning that the most recent training data for the model is over a year old. The fact that there is also Rust code for Harmony could be an indication that OpenAI works internally with the programming language to increase the efficiency of the software.

Harmony is a much more flexible format than the previous chat templates. It allows more meta-instructions and so-called channels, which the model also considers when responding. Despite all the advantages, Harmony also has a disadvantage: the system produces many tokens when processing additional areas such as rules and channels. Even mitigated reasoning cannot compensate for the resulting reduction in efficiency.

GPT-OSS is an agentic model that can call functions. OpenAI goes one step further and has recently enabled web browsing. However, providers such as Anthropic have already been allowing users to control the browser with their models for some time, and Perplexity even offers its own browser. GPT-OSS also makes it possible to execute Python code. The extent to which the generated code is trustworthy is not immediately clear.

OpenAI is just as silent about the details of the training process as it is about the data used for it. Everyone is presumably cooking their own soup here, and the Chinese model providers are also keeping quiet about this. Only Olmo from the Allen AI Institute and SmolLM from Hugging Face have really published all the details.

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.