OpenAI's new o3 model aims to outperform humans in reasoning benchmarks
The new o3 model is designed to outperform humans in math and programming benchmarks, o3-mini efficiency and strong price-performance ratio.
OpenAI CEO Sam Altman presented a preview of the two new models o3 and o3-mini in a video. They are the successors to the o1 reasoning model, which OpenAI released almost two weeks ago. According to Altman, there will be no o2 model. This is "out of respect for our friends at Anthropic", who recently presented their Claude series language models. He did not give any details. Altman also justified the name o3 with OpenAI's "great tradition of being really, really bad at naming things".
Empfohlener redaktioneller Inhalt
Mit Ihrer Zustimmung wird hier ein externes YouTube-Video (Google Ireland Limited) geladen.
Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Google Ireland Limited) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.
Increase in reasoning benchmarks
According to OpenAI, the o3 model sets new standards in demanding technical benchmarks in the areas of programming and mathematics. For example, it achieved a score of 71.7 percent in the "software-style" benchmark "SWE-Bench Verified", which represents an improvement of over 20 percent compared to o1. In the competitive programming benchmark "Codeforces", o3 achieved an Elo score of 2727 –, a performance that surpasses most human competitive programmers. The same applies to mathematical benchmarks: In the mathematics doctoral benchmark "GPQ Diamond", o3 achieves an accuracy of 87.7 percent and thus outperforms typical experts with a mathematics doctorate.
To further demonstrate the reasoning potential of o3, OpenAI presented results from the demanding"Frontier Math Benchmark" (PDF) by Epoch AI. Here, o3 achieved an accuracy of over 25 percent, while previous models were below 2 percent.
Better than humans in reasoning benchmark Arc
o3 celebrated a particular success in the reasoning benchmark "Arc AGI". In a "high-compute" configuration, o3 has now achieved an accuracy of 87.5 percent in the benchmark and has thus achieved human performance of around 85 percent for the first time. This is an important step towards Artificial General Intelligence (AGI), because passing the ARC-AGI does not mean that AGI has been achieved. In fact, o3 still fails at some very simple tasks, which points to fundamental differences to human intelligence, according to an Arc Prize article. OpenAI and Arc Prize want to expand their collaboration in the future.
o3-mini promises performance at low cost
OpenAI also presented the o3-mini model. It defines a "new frontier of cost-effective reasoning performance". With similar performance to o1, it is an order of magnitude faster and cheaper.
With o3-mini, users will be able to choose between three modes with different "reasoning effort". In a demo, the OpenAI researchers showed how o3-mini can evaluate itself in real time – by writing and executing an evaluation routine for itself. "Next time we should ask the model to improve itself," joked CEO Altman.
Videos by heise
Security testing by the public
Altman announced that o3 and o3-mini will soon be released for testing by selected security researchers. The aim is to have the models examined for possible vulnerabilities and potential for abuse before they are made available to the public.
A new "deliberative alignment" process is intended to help align the models more closely with security guidelines. Reasoning should enable them to better recognize and reject unwanted requests.
According to Altman, o3-mini will be released at the end of January and o3 will be released to the public shortly afterwards. Interested researchers can apply for pre-release access until January 10.
Google announces "thinking" Gemini 2.0 model
Meanwhile, Google has announced its own reasoning-enabled language model: according to a blog post, the system, dubbed "Gemini 2.0 Flash", will have a "thinking mode" that checks and improves answers before they are output. Users will have the option of gaining insight into the system's "thoughts".
However, Google's reasoning model will initially only be available in an experimental, limited version. Researcher Noam Shazeer, who became known for his work on the influential "Transformer" paper, played a key role in its development. Shazeer had left Google, but returned following a deal between Google and his start-up Character AI.
(vza)