Confronting Reality: New AI Benchmark OfficeQA
Databricks introduces OfficeQA: An open-source benchmark that tests AI agents in realistic business scenarios.
(Image: heise medien)
- Prof. Jonas Härtfelder
With OfficeQA, Databricks introduces a new open-source benchmark designed to fill a gap in the evaluation of large language models and AI agents. Unlike popular tests such as ARC-AGI-2, Humanity’s Last Exam, or GDPval, OfficeQA does not focus on abstract reasoning tasks but on realistic scenarios from everyday business operations – areas where errors can be costly.
The focus is on so-called grounded reasoning: AI systems must answer questions based on large, heterogeneous, and sometimes unstructured document collections. For this, Databricks draws on the U.S. Treasury Bulletins – nearly 89,000 pages of tables, revisions, and historical data spanning over eight decades. The benchmark includes 246 questions with clearly verifiable answers, divided into “easy” and “hard” depending on how current frontier models perform.
(Image:Â Databricks)
The results are low. Without access to the document corpus, tested agents – including a GPT-5.1 agent and a Claude-Opus-4.5 agent – answer only about two percent of the questions correctly. Even with provided PDFs, the accuracy rate is below 45 percent, and for particularly difficult questions, it drops below 25 percent. The results suggest that strong performance on academic benchmarks says little about readiness for enterprise deployment.
Videos by heise
“Almost right” is not good enough in business
The analysis of errors reveals known but unresolved problems: parsing errors with complex tables, insufficient handling of repeatedly revised financial data, and deficits in the visual understanding of diagrams. What is critical here is less the existence of these weaknesses than their impact: in business contexts, “almost right” is not sufficient – incorrect key figures or outdated values can have serious consequences.
(Image:Â Databricks)
OfficeQA thus sees itself less as a showcase of performance and more as a diagnostic tool. The consistent focus on realistic documents and clearly, automatically verifiable answers is striking. At the same time, the question remains open as to how representative a single – albeit extensive – data corpus is for the diversity of internal company information sources. The new benchmark must first prove itself in further scenarios. It is precisely for this reason that Databricks is launching the Grounded Reasoning Cup 2026: researchers and industry partners are to test OfficeQA beyond the Treasury example, thereby contributing to broader acceptance and further development of the approach.
The OfficeQA benchmark, developed by Databricks, is provided free of charge to the research community as an open-source project and is available via the public GitHub repository.
(vbr)