Xiaomi's new AI model: Close to the competition, but significantly cheaper

Xiaomi MiMo-V2 is a family of AI models for agentic systems. The top model competes with industry leaders but is significantly cheaper via API.

listen Print view
Xiaomi logo on building facade

Xiaomi introduces the MiMo-V2 model family, combining planning, perception, and language for AI agents.

(Image: Mehaniq / Shutterstock.com)

4 min. read

Xiaomi has introduced three AI models that together are intended to form the basis for autonomous AI agents. Simply put, the top model MiMo-V2-Pro functions as the “brain,” the multimodal model MiMo-V2-Omni as the “senses,” and the speech synthesis model MiMo-V2-TTS as the “voice” of agentic systems.

Fuli Luo, who previously worked on DeepSeek R1, is responsible for the development of the AI models. In a post on X, Luo writes that the agentic orientation was not specifically planned but resulted from the rapid shift from the chat to the agent paradigm. She holds out the prospect of an open-source release of the models but makes it conditional on the models achieving sufficient stability.

The top model MiMo-V2-Pro is said to be capable of planning multi-stage tasks, integrating tools, and executing complex workflows. Technically, Xiaomi relies on an expert model with over a trillion parameters, of which only 42 billion are active at any given time. This means only a part of the model is used per request, which can limit computational effort and thus reduce costs. At the same time, MiMo-V2-Pro supports context windows of up to one million tokens, allowing it to process very extensive inputs.

In benchmarks, Xiaomi positions MiMo-V2-Pro in the upper field without clearly surpassing the top models. According to the company, the model ranks among the global top 10 in the Artificial Analysis Intelligence Index and achieves high scores in agent-oriented tests such as ClawEval and PinchBench.

Here's how Xiaomi MiMo-V2-Pro compares.

(Image: Xiaomi)

Before the official presentation, MiMo-V2-Pro already appeared under the name “Hunter Alpha” on platforms like OpenRouter, where the anonymously published model quickly established itself among the most used systems. Among developers, it was initially speculated that it could be a new model from DeepSeek.

Videos by heise

Xiaomi highlights the cost structure of using the API as a key argument. For larger context lengths up to one million tokens, around two dollars per million input tokens and six dollars for output tokens are charged. For comparison: Claude Sonnet 4.6 costs about three and 15 dollars, respectively, and Claude Opus 4.6 costs five and 25 dollars per million tokens.

The multimodal model MiMo-V2-Omni complements MiMo-V2-Pro, which specializes in planning, with a perceptive and executive component. According to Xiaomi, the model processes image, video, and audio data simultaneously to understand situations, derive actions, and execute digital tasks.

Xiaomi illustrates this with several use cases: from analyzing dashcam videos and movie scenes to summarizing a seven-hour podcast, performing browser tasks with OpenClaw, automated shopping, and creating and uploading a short video to TikTok. Xiaomi's goal is for MiMo-V2-Omni to be able to plan tasks not just over minutes but over hours or days, and also to control physical systems, for example in robotics.

While MiMo-V2-Pro plans tasks and MiMo-V2-Omni implements them into actions, MiMo-V2-TTS is intended to handle linguistic communication with users. The speech synthesis model generates spoken responses in real-time and is designed to adapt tone and speaking style to the respective context. Currently, MiMo-V2-TTS only supports English and Chinese, but Xiaomi plans to expand language coverage in the future.

(mack)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.