Xiaomi MiMo's first public model inference system full-chain optimization technical details

robot
Abstract generation in progress
Mars Finance News, May 30 — Xiaomi officially announced the full-chain optimization plan for the inference system of the MiMo-V2.5 series models. The team focused on a hybrid SWA + MoE + multimodal hybrid architecture, systematically reconstructing the complete inference stack from KVCache management, hierarchical caching, prefix caching to scheduling strategies and Prefill/Decode pipelines. KVCache storage compression is reduced to about 1/7 of the same-level solution, significantly lowering inference costs in long-sequence scenarios — this is the core technological foundation of this price reduction. On May 27, the MiMo-V2.5 series API completed a permanent price cut, with the maximum reduction reaching 99%, regardless of input length. (Wide-angle observation)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 9
  • 1
  • Share
Comment
Add a comment
Add a comment
GasFeesForNightRuns
· 7h ago
Xiaomi’s latest round of cost-cutting has slashed costs down to the ankle—are we really talking about a 99% reduction?
View OriginalReply0
QueuePosition
· 7h ago
From chips to frameworks to API pricing, the entire chain is connected. Xiaomi's approach is very similar to the cost-performance strategy in the mobile phone market back in the day.
View OriginalReply0
PerpColdHands
· 7h ago
Waiting for real-world testing, if the 1/7 KVCache compression ratio is accurate, the GPU memory bottleneck can be eased.
View OriginalReply0
TheRedTelephoneBoothInTheRuins
· 7h ago
MoE architecture + SWA attention, this configuration is also among the top tier in the open-source community. Xiaomi's technical disclosure this time is quite transparent.
View OriginalReply0
BlueLakeOverlooker
· 8h ago
The reasoning cost structure has changed, and the pricing benchmarks for downstream applications need to be re-evaluated. The entire ecosystem may need to be reshuffled.
View OriginalReply0
ResilientGoldfish
· 8h ago
Not distinguishing input length is absolutely huge—long-text users are going crazy with joy, and they don’t have to painstakingly count tokens anymore.
View OriginalReply0
GlassDomeUniverse
· 8h ago
The prefill/decode pipeline has been modified; the design of hierarchical caching plus prefix caching is very detailed, indicating it has been refined through real-world business use.
View OriginalReply0
SecondaryMarketDeserter
· 8h ago
Is Xiaomi planning to make large model inference dirt cheap?
A 99% price drop on the API—how can others compete?
View OriginalReply0
Semi-MeltedIceCream
· 8h ago
On May 27th, prices will be permanently lowered, regardless of input length—this pricing strategy directly upends the old token-based billing approach.
View OriginalReply0
View More
  • Pinned