AI agents developed past the job market

A study analyzed 43 benchmarks: AI agents are tested almost exclusively on programming tasks, leaving many other potential areas of application out.

(Image: heise medien)

Mar 10, 2026 at 9:05 pm CET

2 min. read

iX Magazin

By

Dr. Oliver Diedrich

The development of AI agents is heavily focused on programming tasks and inadequately reflects the demands of the real job market. This is the central finding of a study by researchers from Stanford University and Carnegie Mellon University.

The team led by Zora Z. Wang analyzed 43 common benchmarks comprising a total of 72,342 tasks for the investigation published on arXiv and mapped them to 1,016 occupations in the US labor market. The occupations are derived from the US government's O*NET occupational taxonomy, which classifies professional activities based on factors such as work field and required skills.

One-sided Tests

The result is sobering: The benchmarks predominantly test AI agents in the "Computer and Mathematical" work field – an occupational category that accounts for only 7.6 percent of US employment. In contrast, the requirements of highly digitized and economically significant fields such as management, law, architecture, and engineering are hardly covered.

A comparable pattern emerges for the tested skills: Narrow activities like "Getting Information" and "Working with Computers" are overrepresented, although they constitute only a small portion of employment. The category "Interacting with Others," which is central to many professions, is almost entirely missing from the benchmarks.

Overall, the 43 investigated benchmarks cover 56.5 percent of the work field taxonomy and 85.4 percent of the skills taxonomy. The benchmark GDPval is the most broadly positioned with 47.8 percent domain coverage and 58.5 percent skills coverage.

Agents struggle with complex tasks

Videos by heise

The analysis also shows that AI agents reach their limits significantly as task complexity increases – especially with tasks from the categories of information processing and interpersonal interaction. This aligns with other recent findings: For instance, the LiveAgentBench benchmark revealed that agents with tool access could solve only 24 percent of 104 practical tasks, while humans achieved 69 percent.

Based on their findings, the researchers derive three principles for future benchmarks: They should offer broader coverage of real-world occupational domains and skills, include more realistic and complex task assignments, and utilize fine-grained evaluation criteria. Without such a reorientation, there is a risk that AI agent development will bypass economically and socially relevant areas of application.