Mars Finance News, May 30 — Xiaomi officially announced the full-chain optimization plan for the inference system of the MiMo-V2.5 series models. The team focused on a hybrid SWA + MoE + multimodal hybrid architecture, systematically reconstructing the complete inference stack from KVCache management, hierarchical caching, prefix caching to scheduling strategies and Prefill/Decode pipelines. KVCache storage compression is reduced to about 1/7 of the same-level solution, significantly lowering inference costs in long-sequence scenarios — this is the core technological foundation of this price reduction. On May 27, the MiMo-V2.5 series API completed a permanent price cut, with the maximum reduction reaching 99%, regardless of input length. (Wide-angle observation)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

9 Likes

Reward
9
9
1
Share

Comment

Add a comment

GasFeesForNightRuns

· 7h ago

Xiaomi’s latest round of cost-cutting has slashed costs down to the ankle—are we really talking about a 99% reduction?

View OriginalReply0

QueuePosition

· 7h ago

From chips to frameworks to API pricing, the entire chain is connected. Xiaomi's approach is very similar to the cost-performance strategy in the mobile phone market back in the day.

View OriginalReply0

PerpColdHands

· 7h ago

Waiting for real-world testing, if the 1/7 KVCache compression ratio is accurate, the GPU memory bottleneck can be eased.

View OriginalReply0

TheRedTelephoneBoothInTheRuins

· 7h ago

MoE architecture + SWA attention, this configuration is also among the top tier in the open-source community. Xiaomi's technical disclosure this time is quite transparent.

View OriginalReply0

BlueLakeOverlooker

· 8h ago

The reasoning cost structure has changed, and the pricing benchmarks for downstream applications need to be re-evaluated. The entire ecosystem may need to be reshuffled.

View OriginalReply0

ResilientGoldfish

· 8h ago

Not distinguishing input length is absolutely huge—long-text users are going crazy with joy, and they don’t have to painstakingly count tokens anymore.

View OriginalReply0

GlassDomeUniverse

· 8h ago

The prefill/decode pipeline has been modified; the design of hierarchical caching plus prefix caching is very detailed, indicating it has been refined through real-world business use.

View OriginalReply0

SecondaryMarketDeserter

· 8h ago

Is Xiaomi planning to make large model inference dirt cheap?
A 99% price drop on the API—how can others compete?

View OriginalReply0

Semi-MeltedIceCream

· 8h ago

On May 27th, prices will be permanently lowered, regardless of input length—this pricing strategy directly upends the old token-based billing approach.

View OriginalReply0

Trending Topics
View More
#
WinGoldBarsWithGrowthPoints
1.22M Popularity
#
WTICrudeFallsBelow90Dollars
1.18M Popularity
#
StockTradingChallengeUpTo17000U
187.82K Popularity
#
USIranNegotiationGame
9.41M Popularity
#
TradeCFDWinGold
3.2M Popularity

Pinned

Sitemap

Xiaomi MiMo's first public model inference system full-chain optimization technical details

Trending Topics

WinGoldBarsWithGrowthPoints

WTICrudeFallsBelow90Dollars

StockTradingChallengeUpTo17000U

USIranNegotiationGame

TradeCFDWinGold

Pinned