Apertus tested: How the multilingual AI model performs
With Apertus, Swiss researchers have released an open-source and transparent large language model that cannot catch up with the frontrunners, however.
Researchers from Switzerland have published the AI model Apertus (Image: EPFL, ETH Zurich, CSCS / Molinari Design)
- Dr. Christian Winkler
Most major language models focus on English as the main language, but also understand and generate texts in other widely spoken languages. However, lesser-spoken languages are often neglected. Even with so-called open models, which make their weights publicly available, the underlying training data is not known. Apertus, a large language model developed in Switzerland, aims to change all this. The developers are ETH Zurich, EPF Lausanne and the Swiss National Supercomputing Centre (CSCS).
Apertus had been announced for some time, but at the beginning of September the institutions published the model in variants with eight billion and 70 billion parameters. Apertus is available on Hugging Face under the Apache 2.0 licence, but before downloading it, you still have to agree that you will not hold the institutions liable for any damages. Apertus should also be able to process less common languages such as Swiss German or Romansh. This has not yet been available.
Videos by heise
Model architecture
Both models support a context size of 64,000 tokens, with the smaller model having 32 attention heads and the same number of layers, while the larger model has 64 attention heads and 80 layers. Both models have their own architecture and require the latest version of the transformers library or the use of the vLLM or SGLang frameworks to function correctly. llama.cpp does not yet support the models, but Apple's MLX framework does. This means that the models can also be executed quantized if the GPU does not have the necessary RAM.
The model is documented in detail. Special features of the architecture are the xIELU activation function and the QK standard for the attention layer, which is currently also the reason for the lack of support by llama.cpp. Otherwise, like many others, the model uses multi-head query attention and RoPE scaling. The chat template takes some getting used to and is much discussed.
(Image:Â https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509/blob/main/chat_template.jinja)
Model training
The group put a lot of effort into the training and tried out some new things. These include the AdEMAMix optimizer, which is used instead of the frequently used AdamW optimizer. Some new ideas have recently emerged in this environment, such as Moonshot training its Kimi-K2 model with the Muon Optimizer. For optimization, Apertus uses the Goldfish Loss technique instead of the usual cross-entropy loss. Goldfish Loss is designed to prevent models from memorizing token sequences and thus tending to repeat the training material verbatim. The group also used a new scheduler, Warmup-Stable-Decay, which enables continuous training – so that the total length of the training does not have to be fixed in advance.
The researchers explain the different design decisions in detail in their article and show that the selected alternatives lead to faster convergence in a small model, i.e. with fewer training steps (called epochs). However, another experiment in which the researchers trained the model with FP8 weights failed.
However, not only is the training procedure well documented, the selected training data is also disclosed. They mainly consist of FineWeb in different variants, StarCoder and FineMath as well as the free part of CommonPile. The developers even want to take the current version of robots.txt of the source websites into account retrospectively and have cleaned up the data accordingly. Here Apertus joins the small group of models, such as Olmo or SmoLM, which really publish everything and are therefore as transparent as possible.
Apertus tried out
The tech report already indicates that Apertus cannot compete with the current state-of-the-art models. The only way to assess its capabilities is to try it out for yourself. These tests only look at the 70B model, the smaller one with eight billion parameters was always very short. The first test focuses on the model's general knowledge of the heise publishing house:
Prompt:
Explain Heise magazine publishing!
Answer:
Heise Zeitschriftenverlag is a German publishing house specializing in the publication of trade journals and online content in the field of information technology (IT). The publishing house was founded in 1996 and is based in Hanover, Germany.
Heise Zeitschriftenverlag's best-known publications include:
- c't (Computer Technik): One of the best-known and most widely read IT trade journals in Germany. It is published weekly and covers topics such as hardware, software, networks, security, smart home, mobility and much more.
- iX (Information from the world of IT): Another weekly trade magazine aimed at IT professionals and developers. It offers in-depth articles, interviews and reports on current developments in IT.
- Telepolis: An online magazine that deals with the social, political and ethical aspects of information technology. It is published daily and offers background reports, analyses and commentaries.
- heise online: The publisher's online platform that offers the latest news, background reports, test reports and interviews on IT topics. It is one of the leading IT news sources in the German-speaking world.
- heise Developer: An online portal aimed at developers, covering topics such as programming, software development, cloud technologies, DevOps and more.
- heise Security: An online portal that specializes in IT security. It offers the latest news, background reports, test reports and tips on security issues.
- heise online store: An online store where readers and interested parties can buy books, magazine subscriptions and other IT products.
Heise Zeitschriftenverlag is known for the high quality and depth of its reporting, as well as its independent and critical approach to IT topics. It is an important player in German IT journalism and enjoys a high reputation in the industry.
Besides the incorrect founding date, which is incorrect for many models, the information is a colorful mix of correct and incorrect. The model clearly fails the question about the number of "e "s in strawberry:
Prompt:
How many 'e's are in 'strawberry'?
Answer:
2
The answer is 2 because the word 'strawberry' contains two 'e's. The word 'strawberry' contains two 'e's. The correct answer is 2.
Apertus immediately accepts a French question and answers it accordingly:
Prompt:
Explain the algorithm for searching a list?
Antwort:
The algorithm for searching in a list is a method for finding a specific element in a list.To explain this algorithm, we'll use a simple example. Let's assume that
we have a list of integers, for example: [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 7]. We want to find the position of element 5 in this list.
Here are the steps in the algorithm for searching a list:
- Start by initializing a variable, such as `index`, to 0. This variable will enable us to track our progress through the list.
- Go through the list element by element, starting with the first element (index 0).
- For each element in the list, compare the current element with the element we're looking for (in our example, 5).
- If the current element is equal to the element we're looking for, we've found the element and can stop the algorithm.
- If the current element is not equal to the element we're looking for, increment `index` to move on to the next element.
- If we've gone through the whole list without finding the element, this means that the element isn't present in the list. In this case, the algorithm may return `None` (or a value indicating that the element has not been found) or raise an exception, depending on how you wish to handle the situation.
- If the element has been found, return the element's index (i.e. its position in the list).
Here's how this translates into code, using a programming language like Python :
def chercher_element(list, element):
# Set index to 0
index = 0
# Browse the list
while index < len(list):
# Check if the current element is equal to the searched element
if liste[index] == element:
# Element found, return index
return index
# If element not found, increment index
index += 1
# If element not found after going through entire list
return None # Or raise an exception, according to convention
# Example of use
list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 7]
element = 5
resultat = search_element(list, element)
print(resultat) # Should display 3, as 5 is at position 3
Detailed explanation:
- Initialization: Initialize `index` to 0. This is the current position in the list.
- Browse: Each element of the list is browsed, incrementing `index` at each iteration.
- Comparison: For each element, the current element is compared with the element being searched for. If it's the same, the index is returned.
- Non-discovery handling: If the element is not found after going through the whole list, return `None` (or a value indicating that the element has not been found).
- Return: If the element is found, return the index. Otherwise, return `None`.
Complexity:
- Time: O(n), where n is the length of the list. The list is scanned once in the worst case (when the element is not found).
- Space: O(1), as only one variable is used to track the index.
This algorithm is simple and efficient for searching for an element in a list, but it can be optimized in certain cases (for example, by using a dichotomous search if the list is sorted).
However, for an unsorted list, this method is generally the simplest and most straightforward.
The answer is complete, the generated Python program works fine. The French explanation is also excellent. However, the punctuation marks are noticeable, as they do not always fit correctly with their spaces. This is often noticeable with the model, the causes are not clear.
If you give the model input in Friulian, the corresponding output also appears in this language:
Prompt:
No pues vignî fûr usgnot, o ai di studiâ.
Antwort:
Yes, no pues vignî fûr usgnot, ma o ai di studiâ. O ai di studiâ par podê jessi plui utile e informatîf par te. Cun la mia conoscence fin a la date di 2024, no sai ce che al sucêt dopo. Par informazions plui agiornadis, ti consili di cjoli une font di informazion plui agiornade.
For other, less meaningful questions, answers are given in English:
Prompt:
lavorio jo?
Antwort:
It seems like you've entered a phrase or word that doesn't make sense in English. Could you please provide more context or clarify what you mean by "lavorio"? If you're looking for information or assistance on a specific topic, feel free to ask!
The model also speaks Ladin, but there were many repetitions in the test:
Prompt:
Les alghes é n alimënt popolar tla gastronomia orientala, che davagna tres de plö lerch sön nostes mëses.
Antwort:
I alghes, or algae, son alimënts popolar tla gastronomia orientala, en special tla cuschina giapunesa, chinaisa e coreana. I alghes son ricch de minerals, vitamins e proteines e i g'ha inportanza nutritiva e culinaria.
I alghes i g'ha un gust umami, che i g'ha rendĂĽĂĽ populars tla cuschina moderna, en special tla cuschina vegana e vegetariana. I g'ha inportanza culturala e i g'ha stata utilisads per millenni per la su nutriziun e per la su savor.
I alghes i g'ha divers tip, cuma la wakame, la nori, la kombu, la dulse e la kelp, ognĂĽn cun caracteristiches e uss differents.
I alghes i g'ha inportanza ecologica, per via che i cresch in manera sostenibla e i g'ha besogn de pöc risors per crescher.
I alghes i g'ha inportanza culturala e i g'ha stata utilisads per millenni per la su nutriziun e per la su savor.
I alghes i g'ha inportanza ecologica, per via che i cresch in manera sostenibla e i g'ha besogn de pöc risors per crescher.
I alghes i g'ha inportanza culturala e i g'ha stata utilisads per millenni per la su nutriziun e per la su savor.
I alghes i g'ha inportanza ecologica, per via che i cresch in manera sostenibla e i g'ha besogn de pöc risors per crescher.
I alghes i g'ha inportanza culturala e i g'ha stata utilisads per millenni per la su nutriziun e per la su savor.
I alghes i g'ha inportanza ecologica, per via che i cresch in manera sostenibla e i g'ha besogn de pöc risors per crescherLas alghes èn anc utilisadas en la cuschina vegana e vegetariana, per via da lur valur nutritiva e da lur textura interessanta.
Per mangiar, las alghes èn spesso salsadas u seciadas. I pò esser utilisadas en salats, salsas, sòps u sco spessura.
Las alghes èn anc utilisadas en la cuschina vegana e vegetariana, per via da lur valur nutritiva e da lur textura interessanta.
En Asia, las alghes èn part integranta da la cuschina tradiziunala, sco en la cuschina giapunaisa, chinaisa u tailandaisa.
En Europa, las alghes èn plüssa utilisadas en la cuschina mediterrana, sco en la cuschina italiana, spagnola u franzosa.
Las alghes èn anc utilisadas en la cuschina vegana e vegetariana, per via da lur valur nutritiva e da lur textura interessanta.
[...]
The model was unable to answer questions in Romansh well in this test:
Prompt:
Co haveis vus num?
Antwort:
I'm sorry, but I don't understand your question. It seems like it might be a mix of languages or a typo. Could you please rephrase or clarify what you're asking?
However, translations into Romansh succeed well:
Prompt:
Translate the following text into Rhaeto-Romanic: Heise Zeitschriftenverlag is a German publishing house specializing in the publication of trade journals and online content in the field of information technology (IT). The publishing house was founded in 1996 and is based in Hanover, Germany.
Answer:
Il Heise Zeitschriftenverlag è in editur tudestg ch'è spezialisà sin la publicaziun da revistas spezialisadas e cuntegns online en il sectur da la tecnologia d'infurmaziun (IT). Il editur è stà fundà en 1996 e ha sedia a Hannover, en Germania.
Some errors, especially repetitions, in the more unusual languages are surprising – they are presumably related to the small amount of training data. Perhaps the temperature and other settings could be optimized.