Lofree reveals MiMo's cost-reduction secret weapon: pre-filled attention reduces computational load to 10 layers at the global GQA level

robot
Abstract generation in progress
CoinWorld News, Luo Fuli announced on the X platform the algorithm cost reduction mechanism after permanently lowering the API prices for the self-developed large model MiMo-v2.5 series. She revealed that after aligning the API prices with DeepSeek, Xiaomi's high-load inference engine can still break even, with cost reductions mainly coming from the hybrid attention architecture and hierarchical KV cache optimization. Aiming to reduce cache hit costs by 99%, Xiaomi's inference framework implemented hierarchical KV cache optimization for sliding window attention (SWA). Production testing shows that this hierarchical optimization increased the cache token capacity by 5 times and reduced cache costs by 80%. Luo Fuli stated that low-cost inference services are conducive to stimulating terminal intelligence demands. Large model enterprises should avoid blind price wars and, through bottom-layer collaboration design of algorithms and inference systems, keep actual operational expenses below the break-even point.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 5
  • 2
  • Share
Comment
Add a comment
Add a comment
ASolitaryRockBeforeTheVolcano
· 12h ago
MiMo's price reduction this time is really aggressive, a 99% cost reduction sounds like science fiction, but SWA optimization does have some real substance.
View OriginalReply0
LendingRateAnxiety
· 12h ago
Hybrid attention plus hierarchical caching: after this combination, smaller companies face even greater pressure on inference cost.
View OriginalReply0
Pragmatists
· 12h ago
How is a 5x increase in cache capacity achieved? Are there papers on hierarchical KV caches? I want to read them in detail.
View OriginalReply0
InstantNoodlesWithContracts
· 12h ago
Algorithm and system layer collaboration to reduce costs is the right approach; just focusing on price cuts alone has no future. Luo Fuli sees this clearly.
View OriginalReply0
PocketValidator
· 13h ago
DeepSeek remains profitable after alignment, indicating that the initial pricing indeed left room for margins. Now it is considered to be returning to a reasonable level.
View OriginalReply0
  • Pinned