Xiaomi reveals training details of the 1T model MiMo-V2-Pro: utilizing thousands of calories, no rank, no deadline

According to Beating Monitoring, Xiaomi’s large model team leader Luo Fuli disclosed in her first in-depth interview that the MiMo-V2-Pro model base has a total parameter count of 1 trillion, with thousands of GPUs used for training. She believes that a 1 trillion scale is the minimum requirement to achieve near Claude Opus 4.6 level and to secure a spot in the next phase of agent competition.

On the technical level, the Pro version pushes the ratio of global attention to sliding window attention to an extreme sparse ratio of 7:1, controlling the inference cost for long texts while expanding the number of parameters, and continues to use the MTP (Multi-Token Prediction) architecture to accelerate inference with surplus computing power.

On the management side, within the MiMo team of about 100 people, only 30 to 40 are directly involved in core iterations. The team has no formal hierarchy, no clear subgroup divisions, and no fixed delivery deadlines. When encountering unstable numerical issues such as sudden jumps in training loss, the team opts to stop training directly for troubleshooting, even if it means halting for a week or two and spending millions of computational costs.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin