Briefly explained: What's behind the buzzword AI agents
As agentic AI, large language models are supposed to handle tasks without precise specifications. So far, model providers promised more than systems deliver.
(Image: Anggalih Prasetya/Shutterstock.com)
While the reasoning capabilities of large language models (LLMs) are totally 2024, model providers are now emphasizing the agentic nature of their systems. This means that the models solve complex tasks autonomously and independently consult other LLMs or other tools. This can range from the browser and the calculator app to the document repository and the development environment.
According to advertising promises, the aim of autonomous agent LLMs is to revolutionize the current working world, complete the digitalization of the economy and administration and ultimately replace people in many activities – which saves costs. Agents are not only used for tedious, repetitive tasks – but also for complex business processes, software development and research. While the economy – from SMEs to megacorporations – should experience a growth spurt as a result, the systems would then give employees time for other tasks that have so far constantly fallen by the wayside. The bottom line is that technology should make everyone even more efficient and fill the gap in skilled workers, explain the parents of the idea.
Orientation guide: stages of autonomous driving
The best way to abstractly approximate the capabilities of agent-based LLMs is to compare them with the stages of autonomous driving (see box). Advertising and release notes promise fully automated AI systems (level 4). If you believe the hype, Artificial General Intelligence (AGI) will make humans obsolete by GPT-5 at the latest (level 5). Experience reports sound more like automation between levels 2 and 3. Users must be able to intervene at any time if the language models get lost, or the models prompt the users for certain decisions and then wait until the humans have entered their credit card details and passwords in good faith.
Level 0: Manual driving.
Level 1: Assisted. Drivers carry out the steering movements, the vehicle system carries out activities such as braking, turning or accelerating.
Level 2: Partially automated. A human must constantly monitor the system and intervene in case of doubt.
Level 3: Highly automated. A human must be in place, but not permanently monitor the system.
Level 4: Fully automated. The system can handle all driving requirements independently in special situations.
The actual capabilities of the language models boil down to a question of faith. Fans of the technology emphasize their production growth, show successful prototypes or present amazing results of artificial intelligence in social networks. On the other hand, critics see LLMs as stochastic language dice machines that perform better in some areas than in others, but are usually disappointing at the end of the day – unless you keep rolling the dice until the result is presentable enough, with each roll of the dice incurring costs.
Videos by heise
Quality usually proven anecdotally rather than measured
The common benchmarks for large language models (such as GPQA, AIME, SWE-bench or MMLU) offer more structured findings. Here, the test fields cover programming, research and specialist knowledge in the natural sciences. Closed models achieve new highs with every release, while more open models come close to their proprietary competitors – the margin is two to three percentage points in both cases. While the benchmarks generally attest to the LLMs' good to very good general capabilities, the exact figures should be treated with caution. It has been suspected for some time that providers train their models specifically for the tests, i.e. benchmaxing – The training data is not known for any flagship model. Then there is the LMArena, in which people blind-test the style and quality of models at any given prompt. A leaderboard expresses the result with an ELO. Here, too, providers have recently played tricks with particularly appealing variants, but trends can still be quantified here.
So far, there is only anecdotal evidence for the quality of the models in production, no company is forthcoming with measurements on efficiency growth, and this currently seems to be controversial for programming. Although not a good indicator for German SMEs, the current quarterly figures from Meta and Microsoft are a good indicator of the situation in the industry. Meta earns its money almost exclusively from advertising, while Microsoft is growing particularly strongly in the Azure Cloud segment, which also includes AI workloads, but does not provide a more detailed breakdown. It can be assumed that Microsoft in particular would rub its competitors' and shareholders' noses in high profits from LLMs and other AI products.
Buy now and get semi-automated workflows
Meanwhile, there are technologies that increase the output quality or the benefits of large language models for corporate use. With Retrieval Augmented Generation (RAG), the language models are brought closer to the right problem domain with your own documents, which can reduce hallucinations. With agent frameworks and, most recently, the MCP, there are means by which language models can be linked with each other or with all conceivable tools in a structured way. Here, however, the models are given a selection; they cannot choose any tools autonomously. These constructs must be tested in production – There are positive experience reports here, but no benchmarks or figures. Whether the scaling of the applications then pays off is also only apparent in production.
Anyone who buys an Agentic AI product for corporate use now will most likely get semi-automated processes; in the best case, the corporate structure will fit and the employees will only have to monitor the automated processes. This does not always have to involve LLMs. At the level of digitalization in administration and SMEs, there is still a lot to be gained from traditional means. Even the analysts at Gartner, who like to push and sell hypes themselves, warn that out of 1,000 tested products for AI agents, only 130 were more than hot air.
Conclusion
The strengths of the large language models lie in text work, document searches and summarizing content. While the pure language skills of the LLMs are beyond question, very good results in programming range between 30 and 90 percent, depending on the benchmark and model, and between 50 and 85 percent in research and natural sciences. Of course, not everyone achieves the same score on these tasks. Particularly in exotic special cases, the LLMs often fall down – precisely on the problems for which skilled workers or domain experts are employed.
LLMs contain a huge range of knowledge on a wide variety of topics – After all, the top models from the major providers are trained with all the digitally available knowledge of mankind and can be queried accordingly. However, it remains to be seen whether LLMs with the imprecise tool of language will ultimately achieve a clear result via a statistical approximation. Is a system that speaks convincingly but is only correct in half or three quarters of the cases sufficient for autonomous solution finding? Or are these systems, like self-driving cars, stuck at level 3 of autonomy?
(pst)