AI model BERT learns European: EuroBERT presented

Thanks to multilingual training, the EuroBERT model offers better performance in European languages and is suitable for code reasoning.

listen Print view
Artificial,Intelligence,With,Virtual,Hanging,Head,On,Podium.,Global,World

(Image: Andrey Suslov/Shutterstock.com)

4 min. read
By
  • Dr. Christian Winkler
Contents

A consortium of research institutions and industry partners such as the AI platform Hugging Face has presented the multilingual encoder model EuroBERT, which aims to improve performance in European and global languages. EuroBERT is said to be optimized for document-level tasks, supports context sequences with 8192 tokens and provides capabilities for multilingual retrieval, classification, regression as well as math and code understanding.

Prof. Christian Winkler
Prof. Christian Winkler

is a data scientist and machine learning architect. He holds a PhD in theoretical physics and has been working in the field of big data and artificial intelligence for 20 years, with a particular focus on scalable systems and intelligent algorithms for mass text processing. He is professor at Nuremberg Institute of Technology since 2022. His research focuses on the optimization of user experience using modern methods. He is the founder of datanizing GmbH, a speaker at conferences and author of articles on machine learning and text analytics.

The model is suitable for solving classification tasks via fine-tuning, but also as a basis for embedding models. According to the consortium's own benchmarks, EuroBERT should be ahead of the competition in many areas.

The EuroBERT models are available in different sizes (210 million parameters, 610 million parameters, 2.1 billion parameters). A large amount of computing capacity went into the training: the largest model required over 12 GPU years, but the training of the smaller models also took several GPU years. Of course, the fine-tuning works much faster, although the memory required for this should not be underestimated. Even the medium model requires around 14 GB of RAM because the data is stored in float32.

The open training method used by EuroBERT is interesting. According to the paper, around six percent of the training data was available in German and came from the CulturaX corpus. Although this is not much compared to the 41 percent on FineWeb, it is still significantly more than the previous ModernBERT models. This is also reflected in the vocabulary with 128,000 tokens compared to ModernBERT, which has to make do with around 50,000 tokens – different languages require more tokens.

With all the fuss about generative language models, encoder models such as BERT are often forgotten, even though they play a major role in many business applications. BERT stands for Bidirectional Encoder Representations from Transformers, an open-source natural language processing method originally introduced by Google in 2018, around which an ecosystem has long since formed. Such models can be used, for example, to assign texts to specific categories (classification), recognize moods (sentiment detection) or implement semantic searches (information retrieval as a precursor to retrival augmented generation).

As with generative models, training is also very time-consuming here, but the models can be adapted relatively easily to individual requirements. This works particularly well if the basic model is already well pre-trained for the relevant domain. Even if there are powerful multilingual embedding models, unfortunately many basic models are primarily trained with English texts. In this case, fine-tuning with German texts does not always lead to good results. Models specially trained for German are available from Google or the Bavarian State Library, for example, but they are many years old and no longer state of the art.

At the end of 2024, Answer.AI and Hugging Face updated the models with the ModernBERT architecture. Many optimizations (such as Flash Attention) known from the generative language models were built into the encoder architecture. The training process was also optimized with these new findings, resulting in very solid but also fast base models. However, these are also primarily trained with English-language texts. EuroBERT uses the ModernBERT architecture as its basis.

Videos by heise

EuroBERT can be used wherever BERT was previously used. This does not necessarily always lead to better results, which is why it makes sense to define and compare a performance metric (such as Accuracy or F1 score). In our tests, we were able to achieve at least as good results for German-language texts as with the (old) German-language models (and significantly better than with ModernBERT), but we were able to reach our goal faster thanks to the modern architecture. At the same time, you have the advantage of being able to work with much longer texts (context length 8192) and easily switch to other languages.

(olb)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.