How well LLMs program Java and Go

The fact that AI can program is nothing new – but how well it can do it remains unclear. Symflower is now attempting a systematic comparison.

(Image: iX)

Jul 9, 2024 at 10:43 pm CEST

4 min. read

iX Magazin

By

Prof. Christian Winkler

There are new insights into how well LLMs can program – because until now, it was difficult to measure how reliably language models work. They often deliver good results, sometimes none and rarely even wrong ones - but these hallucinations are then often formulated very convincingly. Benchmarks involving humans have therefore been established for generic language models, for example the Chatbot Arena from LMSYS. In parallel, individual quality criteria are measured, which are then aggregated in lead boards.

Systematically checking code

For more specialized language models, this can be done much more systematically. LLMs that can generate program code are the first choice. This code can be checked syntactically and semantically. Symflower, a provider of software for automatic test generation, has investigated precisely this and written an entire blog series on the subject. The findings are exciting and provide interesting insights into the performance of LLMs.

However, there are a few limitations: Code generation is only performed for Java and Go. Other widely used programming languages such as Python and JavaScript have not yet been considered. It is therefore unclear whether the results can also be reproduced there. It is conceivable that better results can be achieved here due to the larger volume of available code.

In earlier parts of the blog series, tests were only generated for simple, "empty" classes. This has now been significantly expanded, and the scenario has become more complicated. LLMs should now generate tests for 23 real programming examples. The results are exciting:

Only 58 percent of the results were translatable at all (only for ten models was it more than 80 percent). Manual reworking is therefore required here. This metric is easy to measure for compiler languages, but would be difficult for Python and JavaScript.
Some models did not produce any translatable code at all. This would be comparable to a real programmer producing only syntactically incorrect code.
Most syntactical errors were rather trivial and can be quickly fixed with IDE support.
For Java, three models (gpt-4o, deepseek-coder, claude-3-opus) always generated translatable code. Unfortunately, this was not possible with Go (which is certainly due to the smaller training set).

People remain indispensable

So programmers are still necessary in any case. It is astonishing that code generation works so much better with Java than with Go. The larger amount of training would therefore give us hope that it could also work well with Python and JavaScript. However, the metrics are more difficult to determine there because the code does not have to be translated. Dynamic typing can lead to further errors that have to be checked manually.

Individual models handle exceptions differently: Here there is both the option of catching them, or alternatively simply letting the test fail in such a case. Both strategies are also used by human programmers, so the models have simply learned both.

Read also

Die kleine Codefabrik: KI-Tools für Entwickler im Überblick

LLMs in der Softwareentwicklung: Was sie bringen und wo sie gescheitert sind

The ranking of the models is exciting. Compared to the last test, Symflower has changed and optimized the score somewhat. This process is not yet complete, as some models generate a higher coverage, while others allow more tests to be translated. The score will therefore be revised again for the next iteration.

Finally, the article shows how efficient the tests are. In some cases, for example, the LLMs generate synchronous instead of asynchronous methods, which leads to considerably longer runtimes. Due to unfavorable handling of permutations and corresponding logging, tests generate huge log files. Both can easily lead to entire test suites no longer running, thereby jeopardizing the stability of the entire software.

The article documents exactly how the models were selected, which sandboxes were used and much more. This is useful if you want to try out the tests yourself. The technical description provides an easy introduction to the framework, which is published on GitHub.