AI accelerates code, but delays tests
Generative AI speeds up programming, but verifying the code becomes more complex. This shifts the bottleneck.
(Image: heise medien)
- Harald Weiss
Generative AI has significantly increased productivity in programming. A study by GitHub Research shows that developers complete programming tasks in controlled experiments with AI assistance around 55 percent faster than without support. However, the speed gain when writing code using AI systems does not automatically mean that software projects progress faster overall. An investigation by the research institute METR (Model Evaluation and Threat Research) shows that experienced developers take on average 19 percent longer when working in familiar code environments with AI tools – primarily due to additional checking and correction steps. One reason: they first have to familiarize themselves with the code generated by the AI in case of errors.
Since testing, debugging, and verification already account for about half of the time spent in conventional software development, delays in these tasks have a particularly strong impact on project duration. With AI tools in programming, the bottleneck shifts: Generating code becomes easier and faster – proving that it works correctly and can be released remains complex. This is all the more significant because the costs of errors increase with each later project phase. A frequently cited analysis by the IBM Systems Sciences Institute quantifies the difference: during implementation, fixing an error is six times pricier than in the design phase. During testing, the factor increases to 15, and in production, up to 100. This becomes a problem, especially in complex enterprise systems. Modern applications consist of many services, application programming interfaces (APIs), and data sources. A change in one place can trigger unexpected side effects in many others. The faster AI tools produce new code, the more frequently such interactions occur – and thus additional sources of error.
When tests become probabilistic
To mitigate this bottleneck, AI systems are increasingly coming onto the market that are intended to test new or modified programs faster. However, this leads to a fundamental change in testing methodology. Classic software testing follows a deterministic model: same input, same program, identical output. Test runs, in which functions are called with defined parameters and must deliver exactly the expected results, are based on this. But with AI systems, this principle only applies to a limited extent. Large language models and other generative processes work based on statistical probabilities and deliver results within a range of possible answers. Therefore, the quality of a system can no longer be checked solely with yes-no tests. The crucial factor is whether the behavior remains within acceptable limits. This also shifts the focus of quality assurance (QA). Instead of complete test coverage, a risk-based approach comes to the fore: teams test critical functions and interfaces more intensively and less relevant parts with less depth. The goal is not mathematical completeness but a reliable assessment of the residual risk.
Videos by heise
Among the providers of AI-supported tools are Keysight Eggplant, SmartBear, OpenText, and Tricentis. The latter recently introduced an “Agentic Quality Engineering Platform,” a platform where autonomously acting AI agents take over quality assurance tasks. It supports, among other things, the SAP GUI and web applications. For this purpose, the platform uses generative AI to create test cases, prioritize existing tests, and summarize the results of large test runs. Technically, it's less about “AI testing software” and more about supporting typical QA work steps: analyzing code changes, selecting relevant tests, grouping error messages, or condensing extensive log files. The approach aims at one of the most time-consuming steps in the testing process: evaluating large amounts of test results. In continuous integration environments, several thousand tests often run per commit, the results of which developers then have to interpret. AI tools can help to recognize patterns faster and narrow down the causes of errors better. The benefit thus shifts from generating individual tests to organizing and evaluating entire test landscapes.
What AI doesn't solve in testing
Despite these developments, the use of AI-based tests remains limited. Many aspects of software quality cannot be derived from test runs. These include security problems, structural code weaknesses, or the avoidance of technical debt – i.e., postponed maintenance and modernization work that increases development effort in the long term. Such issues concern the architecture of an application and not just its runtime behavior. Static code analysis (Static Analysis), which examines the source code for errors and vulnerabilities without execution, security audits, and classic code reviews therefore remain necessary. At best, generative AI can provide hints here, but it does not replace systematic analysis. Especially with security-critical applications, automatic AI testing is not a sufficient quality assurance.
These limitations lead to a fundamental principle of modern development processes: the human remains part of the decision chain. Automated tests can provide hints and evaluate large amounts of data, but the release of a version remains a risk assessment. In many companies, therefore, developers or QA teams still decide whether a new software version can go into production. AI can accelerate this process by condensing information and taking over routine tasks. However, the responsibility for the release remains with humans. The risks of insufficient control were recently demonstrated by outages at Amazon caused by AI tools, after which the company introduced stricter testing mechanisms.
Conclusion
Generative AI significantly speeds up code writing but simultaneously increases the effort required for its verification. More generated code means more variants, more integration points, and thus more potential error risks. AI can help in generating test cases and analyzing large test runs, but it does not solve all quality problems. Issues of security, architecture, and technical debt remain the responsibility of developers and review processes. The key remains: whether a software version can go into production is a human decision.
(mki)