On May 26, the official Xiaomi MiMo account on X released an announcement: the MiMo-V2.5 series API will be permanently reduced in price, with a maximum discount of 99%. All context length is priced uniformly, and Token packages are upgraded by 5–8 times.

This announcement swept through domestic AI circles for an entire week. The industry’s first reaction split into several camps. The largest group said this was “another round of price war”—in the past two years, Chinese foundation model providers have been taking turns cutting prices: Zhipu, DeepSeek, ByteDoubao, and Alibaba Tongyi. If you’re not competing, you’re being left behind.

Another camp viewed it pessimistically: Xiaomi just announced that this year’s profits were cut in half—yet at this time, they are still pouring 60 billion into AI burn, and directly cutting API prices by 90%. That’s a textbook “loss-leader to grab market share.” Others believed this was a continued “DeepSeek effect”: the latter dragged the industry’s pricing benchmark all the way down to the floor, so no one else can afford not to follow, or they’ll be eliminated.

So, as the person in charge of MiMo, Luo Fuli directly released a 5,000-word technical blog last night, openly laying out the engineering ledgers behind the price cuts for everyone to see.

“Look—this is real engineering capability, not a marketing tactic.”

To understand what Luo Fuli is saying, you first need to know exactly what this 99% actually discounts.

It’s not a discount across the entire model. The 99% discount is specifically applied to a pricing tier called Input (Cache Hit)—the part where “users repeatedly read historical context in long conversations.” For regular new inputs (No Cache Hit), the discount is much smaller, and the reduction is minimal for model output (Output).

If you treat the model like a coffee shop, this is easy to understand.

You order a half-sugar latte. A coffee shop has two approaches. One is that every time, it grinds beans from scratch, measures sugar and adds syrup and milk—everyone pays for raw materials and labor every time. But if the model knows you’ll drink the same half-sugar latte every day this week, it might brew a big pot in advance, store it in the freezer, and then, next time, scoop out one cup. MiMo is doing the latter: it changes the repeated-read portion from “compute in real time” to “retrieve in real time.” So the true cost of that portion is close to zero, which naturally enables a 99% discount.

To enable “retrieve in real time,” the technical blog says there are six engineering efforts—none can be missing. Let’s break them down one by one.

Engineering 1: Compress the model’s “memory” to 1/7

When the model is in conversation with you, for each token it must compute and store an “intermediate state” to be used in the next step. This is called KVCache—think of it as the model’s “short-term memory notebook.” Every time it says something, the model writes a summary of what you just said into the notebook; next time, it just flips to the notes instead of listening to everything you said again from the start.

Traditional models perform “Full Attention” at every layer—meaning every token must look at all tokens in the entire conversation. As the notebook keeps getting flipped more and more, it grows thicker and thicker. MiMo-V2.5-Pro changed the architecture: out of 70 layers, 60 layers only attend to the most recent 128 tokens (SWA, Sliding Window Attention), while only 10 layers act as “archivists” and see the entire history.

As a result, the KVCache size is reduced to 1/7 of what Full Attention would use, and the computation amount is reduced to 1/7 as well.

This is the first foundation for cost reduction. By analogy: previously, every employee in a company was required to remember all meeting notes—yet everyone’s brain capacity was overwhelmed and efficiency was low. The new rule reduces the cognitive load of 60 employees to 1/7, leaving only 10 archivists to handle all history. Overall memory capacity doesn’t drop, but efficiency increases by 7 times.

Engineering 2: Make the space saved by SWA truly usable

Compressing the notebook to 1/7 architecturally is only the first step. To turn the “theoretical 1/7” into an “actual 1/7,” there is another hurdle.

In a traditional KVCache system, memory is allocated uniformly to all layers based on the “maximum possible usage.” Meaning: even if the 60 SWA layers only need a small notebook, the system still allocates to every layer the “archivist’s big notebook” size. The space saved by SWA is reserved but left unused—so it’s as if no savings were made.

Luo Fuli’s team’s approach is to split KVCache into two independent pools. The 10 layers using Full Attention use the “large pool,” allocated by full length. The 60 SWA layers use the “small pool,” allocated only for a 128-token window.

By analogy: the company used to give each employee a “file cabinet that can hold 100 years of documents.” But in reality, those 60 employees only need “small cabinets for one week’s files.” In the big cabinets, 99% of the space is empty. The new method allocates cabinets according to actual needs. As a result, the whole office can fit more than 5 times as many colleagues to work—on the same GPU, the number of concurrent users that can be served increases by 5 times.

This step may look simple, but without it, the advantages of the SWA architecture are basically wasted.

Engineering 3: Make “old users repeatedly reading” actually hit the cache

With the notebook compressed to 1/7 and the saved space actually usable, the next step is to solve another old problem: the hit rate of prefix caching.

Many users’ conversations share the same beginnings—identical system prompts, the same codebase, and the same long document. The system stores the computed results, and when the same prefix appears again, it matches and reuses them directly. This mechanism is called a prefix cache.

But under the SWA mode, a pitfall appears: two requests with identical tokens do not necessarily mean the KV cache is still valid. The prefix may have been computed before, but the parts outside the SWA window have already been evicted. If the system reuses data according to the old rule “token match means cache hit,” it may read invalid or overwritten data, and the model’s performance will collapse.

Luo Fuli’s team upgraded the rule to “window-safe length”—only promising to reuse the portion that you can borrow in full.

By analogy: in a library with 1,000,000 books, you want to borrow the complete set of three volumes of “The Three-Body Problem.” The old architecture might tell you, “This book is available.” But when you go there, the shelf only has the cover and the first volume; the other two have been borrowed out. This kind of “pseudo-hit” makes you run there for nothing and forces you to borrow again. The new system changes the rule so it only promises the portion you can borrow in full—first it gives you volume 1, and then it adjusts and gives you volumes 2 and 3 later.

It sounds stricter, and you might think the hit rate would drop. But in practice it’s the opposite: because SWA reduces the KVCache size to 1/7, within the same storage space, you can store multiple times more content. The real hit rate is therefore increased dramatically.

Luo Fuli’s blog provides real online test numbers: under mainstream harness frameworks, the server cache hit rate averages 93%, and for high-frequency, long-cycle users it can reach 95% or higher.

To translate what that means: 95% of “repeated read” requests don’t even need the GPU to compute—they are fetched directly from cache. This is the physical basis for the 99% discount.

Engineering 4: Put “cache” onto the GPU’s built-in SSD

Now that the hit rate is up, the next question is: where do you put these caches?

GPU memory (HBM on the GPU) is expensive and limited. An H100 eight-card machine has only 640GB of VRAM, but the KVCache MiMo needs to store might be on the order of tens of TB. So you must use a tiered approach: the most recently used data goes into VRAM (L1), slightly older data goes into CPU memory (L2), and cold data is stored in distributed cache (L3).

The industry’s conventional practice is to build a dedicated storage cluster for L3—specialized hardware, specialized data centers—paying rental fees every month.

Xiaomi’s storage team does it differently. They developed a distributed cache called GCache and deploy it directly on the SSDs that come with the GPU machines—mixing training tasks and inference tasks on the same machines.

In plain Mandarin translation: others rent a warehouse to store huge amounts of data; Xiaomi found that the GPU machines’ garage is actually empty, so they store the data straight in there. They save the monthly rent.

The blog’s original wording is: “Additional storage costs are 0.”

The impact of this is bigger than it looks. In a typical “AI company compute accounting” ledger, storage cost is a fixed expense: the larger the model and the more users there are, the longer the storage bill gets. This GCache approach eliminates that item entirely. Combined with SWA’s small size and the 93–95% hit rate, the KVCache TTL (time to live) in L3 extends from minutes to hours, or even days. The longer the TTL, the wider the cache-hit window for historical context, the higher the cache hit rate, and the more defensible the 99% discount becomes.

Engineering 5: Route cache-hit requests along the shortest path

With caching able to be stored, retrieved, and cheaply checked, the final step is: how to route the correct requests to the correct machines.

Xiaomi developed its own scheduling system called LLM-Router, which does three things:

First, affinity scheduling. Requests with the same prefix are routed to the same machine to maximize cache reuse.

Second, length bucketing. Short requests (0–64K), medium requests (64K–256K), and long requests (256K–1M) are assigned to different processing channels to avoid short requests being slowed down by long ones.

Third, TTFT optimization. In the queue waiting for inference, requests with smaller actual computation are prioritized (that is, requests with a large number of cache hits). This prevents them from being blocked by “fresh inputs” that require full recomputation.

For example, in a typical airport scheduling process, passengers bound for the same destination are grouped in the same departure lounge and share the baggage retrieval flow—this is affinity scheduling. Passengers with carry-on luggage and those with large checked bags go through different security lanes, so the faster ones aren’t delayed by the slower—this is length bucketing. When boarding, prioritize those who only have a carry-on bag; they board faster, allowing the plane to leave earlier—this is TTFT optimization.

In real testing, this scheduling strategy increased the L2 cache hit rate by 25%, improved single-machine input throughput by 30%, and reduced the P90 latency for long requests by 30%.

Translated into plain logic: the same GPU can serve more users. The other half of the price-reduction logic is right here—higher effective output per unit of compute, and lower cost per user.

Engineering 6: Make the model “typing” faster too

The first five items all optimize the “reading” side—driving the cost of users repeatedly reading historical context to near 0. The sixth item optimizes the “writing” side—that is, the process of generating the next token.

Traditional models can only generate 1 token at a time. MiMo natively supports 3-layer MTP (Multi-Token Prediction): it predicts the next 3 tokens in one shot. If the middle predictions are correct, it skips the intermediate computation.

For example, traditional typing is letter by letter—you press 4 keys to type “today’s weather.” MTP is like having an autocomplete that guesses what your next 1–2 letters will be. If it guesses correctly, you don’t need to press those two keys again.

MiMo’s MTP has been tested in agentic scenarios: before the first 128 tokens, decoding accelerates by 2.3 times; from 128 to 256 tokens, it accelerates by 1.5 times.

The significance is this: the 99% discount specifically targets Input (Cache Hit). But when actually serving users, input and output happen within the same request. If output isn’t also reduced, you only save half of the overall request cost. MTP reduces that other half as well, and only then does the profitability model behind the full price cut close the loop.

Connect the six items into a cost-reduction chain:

SWA architecture → KVCache reduced to 1/7 → true capacity release through dual pools → the same GPU can hold 5+ times more concurrency → prefix cache hit rate 93–95% → 95% of requests require almost no computation → GCache makes storage costs return to zero → scheduling routes cache hits to be prioritized → MTP makes generation also cheaper → unit request GPU time drops by an order of magnitude → unit cost drops by 95%+ → pricing drops 99%, and the gross margin remains positive.

If any one link is missing, this chain breaks at that point. The 99% price cut is not a marketing number; it’s the cumulative effect of six engineering pillars plus real online validation.

Looking back at the industry’s initial interpretations, each has some truth. This year’s price war among China’s large model companies is real. Xiaomi’s profit being halved and the fact that they are still spending big on AI is real. DeepSeek dragging industry pricing down to the floor is real too.

But by publishing this detailed technical blog and openly breaking down the technical details, Luo Fuli is clearly aiming to push back against the claim that this is just marketing for a price war—emphasizing that “technology problems should be handled as technology problems, and marketing problems should be handled as marketing problems.”

In her blog, she wrote that the inference efficiency of the MiMo-V2.5 series models does not come from a single breakthrough in one area, but from coordinated optimization across multiple dimensions. Hybrid SWA lets prefill and decode benefit at the same time, yet an insufficiently optimized KVCache implementation would instead increase costs across the board. Around these goals, the MiMo team systematically rebuilt KVCache management, hierarchical caching, and the prefix cache tree, tackled the core SWA KVCache issues, optimized the scheduling strategy and the Prefill/Decode pipeline, and validated it in real online scenarios—ultimately turning the theoretical efficiency advantages into real production performance. Only then does Hybrid SWA truly demonstrate its architectural strengths in long-text reasoning with both strength and efficiency. When combined with MoE configurations and various multimodal inference optimizations, it greatly improves online inference service performance.

This is a systematic approach to AI engineering—also a cost-reduction method that the industry should learn from together.

Price wars don’t need blogs; engineering execution needs them.

LAYOUT REFERENCE (source): total_lines=131, non_empty_lines=65, blank_lines=66

DEEPSEEK3.18%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

10 Likes

Reward
10
11
Repost
Share

Comment

Add a comment

SushiLatency

· 1h ago

Unified pricing in context is user-friendly, but can small developers really benefit from the rewards?

View OriginalReply0

MidnightReconciler

· 6h ago

MiMo-V2.5, this naming, it feels like the version number is running out of digits.

View OriginalReply0

PaperfoldDao

· 6h ago

Xiaomi’s profits have been cut in half, yet it’s still burning through 60 billion yuan—Lei Jun is showing the ferocity of going all in on AI.

View OriginalReply0

NeonMint

· 7h ago

Unified pricing sounds fair, long-text scenario users are ecstatic, short-text users might feel like they are subsidizing others.

View OriginalReply0

MosaicButterfly

· 7h ago

The idea of sacrificing profit to capture market share was also heard back in the shared bicycle days, and everyone knows how it ended.

View OriginalReply0

GateUser-e3701961

· 7h ago

Token package upgrade by 5-8 times, in plain language, it's like buying 1 before and now getting 8, but if you don't use it all, does that count as a disguised lock-up?

View OriginalReply0

SecondaryMarketDeserter

· 8h ago

A 99% drop; this number looks like a promotional slogan, but can the actual cost structure support it?

View OriginalReply0

GateUser-0b71fc11

· 8h ago

Luofuli said to put a period, but I think it looks more like a colon, and there's a big show afterward.

View OriginalReply0

HedgeHedgeBaby

· 8h ago

The name MiMo always makes me read it as mimo, like some kind of small rodent.

View OriginalReply0

ReorgSurvivor

· 8h ago

The domestic AI community has been buzzing all week, but there's not much activity on overseas X; the narrative of going global still needs some effort.

View OriginalReply0

Trending Topics
View More
#
WinGoldBarsWithGrowthPoints
1.25M Popularity
#
WTICrudeFallsBelow90Dollars
1.57M Popularity
#
StockTradingChallengeUpTo17000U
209.8K Popularity
#
USIranNegotiationGame
9.36M Popularity
#
TradeCFDWinGold
3.22M Popularity

Pinned

Sitemap

Xiaomi MiMo Cuts the Price by 99%—Not Just Marketing! Luo Fulili Posts on X and Claps Back at Naysayers

Trending Topics

WinGoldBarsWithGrowthPoints

WTICrudeFallsBelow90Dollars

StockTradingChallengeUpTo17000U

USIranNegotiationGame

TradeCFDWinGold

Pinned