Gemini 2.5 Computer Use – Google's AI uses the browser
Google presents an AI model that can use the browser like a human. Gemini 2.5 Computer Use utilises visual and reasoning capabilities.
(Image: Google)
With Gemini 2.5 Computer Use, Google is presenting an AI model that specializes in using the web via a browser in the same way that humans do. The model primarily utilizes visual and reasoning capabilities from Gemini 2.5 Pro. Thanks to these, Gemini can mimic human behavior particularly well and therefore perform a task particularly well.
As with other AI providers, Gemini 2.5 Computer Use can also fill out forms, scroll, and click through websites. Of course, this also requires agent capabilities. These were previously available as a Gemini API. However, this was a non-specialized version of Gemini. The new model should be able to handle interfaces much better, writes Google in a blog post. Gemini 2.5 Computer Use will initially also be available via the Gemini API in Google AI Studio and Vertex AI.
AI model uses screenshots and agentic capabilities
First, the model analyzes a task and then generates an initial response. This usually corresponds to a function call that results in an action, such as clicking or typing. A screenshot is taken to understand the interface. It is also possible for the model to ask the person who gave the task before performing an action. This is to avoid buying 100 pairs of socks in the wrong size or other dangerous situations. This is followed by a new screenshot and a new function call—until the original task is completed.
Videos by heise
Gemini 2.5 Computer Use is optimized for web browsing; the model does not perform quite as well with mobile UIs. Google's target group is primarily developers who can test their software using Computer Use. Variations of the model also support the functions in Google AI Mode, the newly introduced AI search, and Project Mariner, Google's version of an AI agent.
Anthropic has also already introduced a computer-use mode from Claude. Here, too, screenshots are used, which the model analyzes. OpenAI's AI Agent Operator and ChatGPT Agent also work with screen recordings and agentic capabilities that enable forms to be filled out, for example.
(emw)