Xiaomi reveals training details of the 1T model MiMo-V2-Pro: utilizing thousands of calories, no rank, no deadline

robot
Abstract generation in progress

ME News Report, April 24 (UTC+8), according to Beating Monitoring, Xiaomi’s large model team leader Luo Fuli disclosed in her first in-depth interview that the MiMo-V2-Pro model base has a total parameter count of 1 trillion, trained using thousands of GPUs. She believes that a 1 trillion scale is the minimum requirement to achieve performance close to Claude Opus 4.6 and to secure a spot in the next stage of Agent competition.
On the technical level, the Pro version pushes the ratio of global attention to sliding window attention to an extreme sparse ratio of 7:1, controlling inference costs for long texts while expanding parameter size, and continues to use the MTP (Multi-Token Prediction) architecture to accelerate inference with surplus computing power.
On the management side, within the hundred-person MiMo team, only about thirty to forty people are directly involved in core iterations. The team has no hierarchical ranks, no clear subgroup divisions, and no fixed delivery deadlines. When encountering unstable numerical issues such as sudden jumps in training loss, the team opts to stop training directly for troubleshooting, even if it means halting for a week or two and spending millions of computational costs.
(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin