Make AI models open source, but do it right!

An open source license is not enough: manufacturers of AI models should make them open source, including code and training data, demands Holger Voormann.

listen Print view
Sprengte Ketten

(Image: Romolo Tavani/Shutterstock.com)

4 min. read
By
  • Holger Voormann

Thanks to DeepSeek, Meta, Mistral, Microsoft, Alibaba, Google and all the others who publish AI models and their parameters under an open source license. But open weights models are not yet open source AI. Be brave, take a leaf out of Ai2 's book and publish all the data and code you used to create the models! In my opinion, there are four good reasons for this.

Holger Voormann
Holger Voormann

Holger Voormann is a freelance computer scientist. He regularly reports on new releases of the Eclipse development environment for heise online. He is a contributor to Eclipse, llama.cpp and other open source projects.

Firstly, you are on the winning team. Surely you have a few secret tricks up your sleeve to train your models. But you have tried many, but by no means all, of the possibilities to find the best way. Get together and pool your innovative power. It only took four months and eight days from OpenAI o1-preview, the first reasoning model, non-downloadable and with hidden reasoning, to the freely available DeepSeek R1, which spurred the development of further reasoning models. In the worst case, you only lose a head start for a short time.

Secondly, it saves resources: human resources, time and computing power. The latter is a real problem. Despite efficiency gains in training and execution, models today typically require more computing power than in the past because they are larger, because more synthetic data is used in training, and because current reasoning models often generate more tokens for reasoning than for the answer itself. Your self-reasoning is causing unnecessary climate damage.

Thirdly, it would only be fair to publish the training data, because it is not your data. For example, the code you use to train your models, which is essential for reasoning, comes from open source projects. For much of the data, it is also legally unclear whether it can be used for training at all. And because you do not openly admit exactly what data you are using, there will be legal uncertainty for a long time to come.

Fourthly, it would be helpful in terms of use. A look at the training data would make prompt engineering less blind and allow you to try things out in a more targeted way: What is the best way to format a table in a query? In Markdown, HTML or LaTeX? Are wrapping and indentation helpful, harmful or just a waste of tokens? Google recently published its Gemma 3 models as Open Weights. The models are supposed to be able to handle function calling; what is missing is instructions on exactly how to do this. Just as in traditional software development it is said that the truth lies in the code, with AI models it lies in the training data and the training code.

Videos by heise

Regarding its new QwQ-32B reasoning model, Alibaba reveals that the code generated during reinforcement learning was checked for correctness by software tests. I could well imagine that an open source community could make a valuable contribution here in order to cover more programming languages or refine the evaluation using quality metrics. Especially us older people, for whom AI is still new and Python is not the preferred programming language, could support the next generation here.

And one more thing: please don't just publish your data and code on Hugging Face and GitHub, but do it as an open source project in a vendor-independent location, for example at the Apache Foundation or the Eclipse Foundation. This makes it more attractive for others to get involved, and nothing disappears if a company is bought out or changes its mind. Open source is so much more than freeware, for all of us. I hope you will join us!

(mho)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.