Confronting Reality: New AI Benchmark OfficeQA

Databricks introduces OfficeQA: An open-source benchmark that tests AI agents in realistic business scenarios.

listen Print view
PC with brain inside, orange background

(Image: heise medien)

3 min. read
By
  • Prof. Jonas Härtfelder

With OfficeQA, Databricks introduces a new open-source benchmark designed to fill a gap in the evaluation of large language models and AI agents. Unlike popular tests such as ARC-AGI-2, Humanity’s Last Exam, or GDPval, OfficeQA does not focus on abstract reasoning tasks but on realistic scenarios from everyday business operations – areas where errors can be costly.

The focus is on so-called grounded reasoning: AI systems must answer questions based on large, heterogeneous, and sometimes unstructured document collections. For this, Databricks draws on the U.S. Treasury Bulletins – nearly 89,000 pages of tables, revisions, and historical data spanning over eight decades. The benchmark includes 246 questions with clearly verifiable answers, divided into “easy” and “hard” depending on how current frontier models perform.

Anthropics Claude Opus 4.5 Agent solved 37.4 percent, while OpenAI’s GPT-5.1 Agent achieved 43.1 percent on the entire dataset. On OfficeQA-Hard, a subset with 113 particularly difficult examples, Claude Opus 4.5 Agent scored 21.1 percent and GPT-5.1 Agent 24.8 percent.

(Image: Databricks)

The results are low. Without access to the document corpus, tested agents – including a GPT-5.1 agent and a Claude-Opus-4.5 agent – answer only about two percent of the questions correctly. Even with provided PDFs, the accuracy rate is below 45 percent, and for particularly difficult questions, it drops below 25 percent. The results suggest that strong performance on academic benchmarks says little about readiness for enterprise deployment.

Videos by heise

The analysis of errors reveals known but unresolved problems: parsing errors with complex tables, insufficient handling of repeatedly revised financial data, and deficits in the visual understanding of diagrams. What is critical here is less the existence of these weaknesses than their impact: in business contexts, “almost right” is not sufficient – incorrect key figures or outdated values can have serious consequences.

Test question on visual interpretation: AI agents fail to correctly determine the number of local maxima in the line graphs on page 5 of the US Treasury Monthly Bulletin (September 1990).

(Image: Databricks)

OfficeQA thus sees itself less as a showcase of performance and more as a diagnostic tool. The consistent focus on realistic documents and clearly, automatically verifiable answers is striking. At the same time, the question remains open as to how representative a single – albeit extensive – data corpus is for the diversity of internal company information sources. The new benchmark must first prove itself in further scenarios. It is precisely for this reason that Databricks is launching the Grounded Reasoning Cup 2026: researchers and industry partners are to test OfficeQA beyond the Treasury example, thereby contributing to broader acceptance and further development of the approach.

The OfficeQA benchmark, developed by Databricks, is provided free of charge to the research community as an open-source project and is available via the public GitHub repository.

(vbr)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.