Gemini 2.5 Computer Use – Google's AI uses the browser

Google presents an AI model that can use the browser like a human. Gemini 2.5 Computer Use utilises visual and reasoning capabilities.

listen Print view
Gemini stands on a blue background.

(Image: Google)

2 min. read

With Gemini 2.5 Computer Use, Google is presenting an AI model that specializes in using the web via a browser in the same way that humans do. The model primarily utilizes visual and reasoning capabilities from Gemini 2.5 Pro. Thanks to these, Gemini can mimic human behavior particularly well and therefore perform a task particularly well.

As with other AI providers, Gemini 2.5 Computer Use can also fill out forms, scroll, and click through websites. Of course, this also requires agent capabilities. These were previously available as a Gemini API. However, this was a non-specialized version of Gemini. The new model should be able to handle interfaces much better, writes Google in a blog post. Gemini 2.5 Computer Use will initially also be available via the Gemini API in Google AI Studio and Vertex AI.

First, the model analyzes a task and then generates an initial response. This usually corresponds to a function call that results in an action, such as clicking or typing. A screenshot is taken to understand the interface. It is also possible for the model to ask the person who gave the task before performing an action. This is to avoid buying 100 pairs of socks in the wrong size or other dangerous situations. This is followed by a new screenshot and a new function call—until the original task is completed.

Videos by heise

Gemini 2.5 Computer Use is optimized for web browsing; the model does not perform quite as well with mobile UIs. Google's target group is primarily developers who can test their software using Computer Use. Variations of the model also support the functions in Google AI Mode, the newly introduced AI search, and Project Mariner, Google's version of an AI agent.

Anthropic has also already introduced a computer-use mode from Claude. Here, too, screenshots are used, which the model analyzes. OpenAI's AI Agent Operator and ChatGPT Agent also work with screen recordings and agentic capabilities that enable forms to be filled out, for example.

(emw)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.