Xiaomi discloses training details for its 1T MiMo-V2-Pro model: thousands of GPUs used, no ranks or grades, no deadlines

robot
Abstract generation in progress

ME News news, April 24 (UTC+8). According to Dongcha Beating monitoring, Luo Fuli, head of Xiaomi’s large-model team, disclosed in her first in-depth interview that the MiMo-V2-Pro model’s base has a total parameter count of 1T, and training involved thousands of GPUs. She believes that the 1T scale is currently the baseline for achieving a level close to Claude Opus 4.6 and obtaining the entry ticket/qualification to compete in the next phase of Agent competition.

On the technical side, the Pro version pushes the ratio of global attention to sliding-window attention to an extreme sparse ratio of 7:1, controlling the inference cost of long texts while increasing the parameter count, and continues to use the MTP (Multi-Token Prediction) architecture to leverage surplus compute to accelerate inference.

On the management side, in the 100-person MiMo team, only 30 to 40 people directly take part in core iterations. The team has not set up job ranks, and there are no clear team divisions or delivery deadlines. When facing unstable numerical issues such as training loss spikes, the team chooses to stop training directly to investigate, even if it means shutting down for one to two weeks and spending millions in computational costs.

(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned