How AI is changing open source development
AI has an impact on the development of open source software in many areas. It offers opportunities, but also presents the community with new challenges.
(Image: aboutstock/Shutterstock.com)
- Holger Voormann
AI is shaking up the communities of open source projects and posing important questions for developers. Can open source code be used without restriction to train AI models? Where does AI help and where does it harm open source projects – when answering questions about code, programming and project management? What is open washing and what role does open source play in the EU AI Act? And finally, the crucial question: Will AI replace open source? Answers are provided by the community itself. The videos of presentations from open source conferences over the last six months linked in the text provide deeper insights: the Open Community Experience (OCX) with its sub-conferences EclipseCon and OSGi Summit, 38C3 and FOSDEM 2025. Representatives of the Eclipse Foundation, the organizer of OCX, also provide answers to these questions.
Open source code as AI training data: (il)legal?
Code makes up a large proportion of the data used to train today's AI models. It is not only needed to make the models fit for programming tasks, but also to improve their reasoning capabilities. Some chatbot models are also able to create Python code that partially calculates the answer and runs in a sandbox – so that it cannot cause any damage – instead of giving an answer directly. The model then generates the actual answer with the output from the program execution. This code interpreter is particularly useful for questions that require more complex calculations to answer.
The code used to train the models is – how could it be otherwise – open source: freely available in large quantities on the net. Meta's Llama-3 base models, for example, were created from training with 15 trillion tokens, which corresponds to more than 10 trillion words, 17 percent of which is code. But how is this legal? Is a large language model a work derived from the code on which it was trained? And if so, is it subject to the relevant license conditions?
The legal situation is unclear. In Hugging Face BigCode (see OCX talk), an extensive collection of source code collected from public GitHub repositories, there is no code under the Eclipse Public License (EPL), only code under the Apache, MIT and similar licenses. In contrast to the EPL, these licenses do not require derived works to be published under the same license. Mike Milinkovich, Managing Director of the Eclipse Foundation, does not expect a violation of the EPL, but cannot rule it out with certainty either. It could still take a few years before the AI issue is clarified, as copyright law is largely uniform from country to country, but differs in detail and there is still no strong consensus on whether it is a derivative work or fair use. An EPL version 3.0 that explicitly permits or prohibits use as training data is not planned and would not apply retroactively to existing code anyway.
Videos by heise
Irrespective of whether the use as training data is legally covered by fair use or not, the question arises as to whether it is morally acceptable to use open source code to train AI models without being asked. Those who publish their code as open source usually do so in the hope of receiving contributions from others: Bug reports, suggestions for improvement and occasional code contributions through to active participation in the project. A purpose limitation or a ban on commercial use tends to be a hindrance here. Restrictive terms of use limit the circle of potential users, from which the contributors are in turn recruited. By outsourcing an open source project to a vendor-independent location, as offered by the Eclipse Foundation or other open source organizations, you give up even more control, but make it more attractive for companies and individuals to participate. As long as not everyone who uses open source directly participates, it is difficult to ask those who use the code indirectly as training data to give something back.
Are chatbots making work easier or adding to the workload?
But it's not just about taking training data, it's also about giving: Chatbots relieve developers by answering questions about open source projects. Questions that would otherwise have been asked in internal project forums or on websites such as Stack Overflow and answered there by the developers themselves or by users. Smaller models can even be run on your own computer without an expensive graphics card if you accept a loss of speed: for example, with one of the two well-known open source command line tools MLX from Apple for Apple computers or llama.cpp for macOS, Linux and Windows; with Ollama, which simplifies the management of models, or with llamafile, which packs the model and the code for execution in a single file and, like Ollama, is based on llama.cpp.
Chatbots are outpacing Stack Overflow: Since the release of ChatGPT at the end of November 2022, the number of new questions and answers on Stack Overflow has halved every year. And even the questions that are still being asked sometimes refer to chatbot solutions that don't work and things that don't exist and have probably been hallucinated by an AI. Even if chatbots are sometimes wrong, they usually provide information faster and in a friendlier way than humans on Stack Overflow and elsewhere or than a web search can. "Let me ask ChatGPT for you" is the new "Let me google that for you".
The downside of these AI helpers is that they also generate Stack Overflow answers or contributions for open source projects – for example Curl –, which only turn out to be useless on closer inspection and thus cause unnecessary effort. Another disadvantage of AI-generated solution proposals is that they tend to favor older rather than current frameworks and tools. Newer information is taken into account if it is included in a query. However, in order for the AI to be able to apply it in a meaningful way, i.e. generalize it, it must first have been trained using the relevant data in sufficient quantity and quality.
If questions are not asked publicly but to chatbots, project maintainers also do not find out what problems others are having with their software. Not only the lack of feedback could prove problematic for the projects, but also fewer publicly available questions and answers for training future models. In the long term, it is therefore in the interest of both sides to find a solution to the broken feedback loop.
AI support for programming
In addition to general chatbots, there is also AI support specifically for programming: for generating code, code comments and tests as well as for troubleshooting and improving existing code. Integrated as a chatbot or – to provide context-dependent suggestions based on the surrounding code, trained specifically or in addition to interaction in chat form – directly in the code editor as code completion, they also help when using open source frameworks. Due to the added context, the queries are usually longer and the computing effort is therefore higher. There are few freely available offers, which also require registration and are limited to a certain number of code completion suggestions and chat requests per month.
There are open source tools as alternatives to the top dog GitHub Copilot and those that make use of Copilot to offer better tooling. Eclipse Theia is an alternative to Visual Studio Code with GitHub Copilot (more information on this in a blog post and OCX presentation by EclipseSource): Requests that are sent externally are logged and can be viewed; agents can be defined where you can specify exactly what additional information should be included. In addition to GitHub Copilot, models installed locally on the computer can also be used, which has only recently become possible with GitHub Copilot.
One example of tooling improved with the help of GitHub Copilot is the Visual Studio code extension Spring Tools to support programming in Java with the Spring web framework (see OCX presentation). "Explain ... with Copilot" links appear at certain points in the code. For example, if you click on the "Explain Query with Copilot" link that Spring Tools displays for Spring-specific annotations with an SQL query, you will receive the explanation generated by Copilot without having to formulate a question yourself. For manually created questions, Spring Tools can add important information to the prompt before it is sent to Copilot –, for example, that Jakarta EE and no longer Java EE should be used for a project with Spring Boot 3.
Spring Tools can also add buttons to the response sent back by Copilot, for example "Apply Changes" at the end of a response with several code fragments. This is done to transfer all of them to the correct place in the project with a single click. One of the difficulties of Copilot-based tooling is that Copilot is constantly being developed further: The same "Explain ... with Copilot" link that provides a good explanation today may no longer work tomorrow. A further difficulty arises if GitHub does not know Copilot what is being asked for because the framework used is too new or Copilot's level of knowledge is too old. The length restriction can also pose a problem when enriching the prompt.
(Image:Â TechSolution/Shutterstock)
Following the great success of the first betterCode() GenAI, the online conference on AI-supported software development will take place again on June 26.
The organizers iX and dpunkt.verlag have updated the conference program and further improved it based on feedback. It offers the following presentations:
- Software development with Copilot, ChatGPT and Co
- What's new in AI coding tools?
- Testing software with AI support
- Defeating dinosaurs with ChatGPT – LLMs for analyzing old systems
- Strengths and weaknesses of AI-supported, secure software development
- Legal aspects of AI-supported software development