ME News message: On April 18 (UTC+8), according to Beating Monitoring, Moonshot AI and Tsinghua University released a new paper on arXiv on April 16 titled “Prefill-as-a-Service,” proposing to run the prefill stage (prefill) of large-model inference across data centers. Large-model inference consists of two steps: prefill first reads the input all at once and generates a KV cache; decode then outputs the result token by token based on this cache. The hardware requirements for the two steps are completely different: prefill consumes compute, while decode consumes memory bandwidth. The mainstream industry approach is to split the two steps across different machines (PD separation), but this requires RDMA interconnection within the same data center, because the KV cache of dense attention models outputs tens of Gbps per second; if transmission is slow, the GPUs will sit idle. The turning point comes from a new generation of hybrid attention models. The paper’s experiments show that models such as Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, etc., reduce KV cache throughput by about one order of magnitude by combining a small number of full attention layers with a large number of linear layers; Ring-2.5-1T achieves an overall compression ratio of 36 times. At this point, the KV cache can be moved from the dedicated RDMA network to be uploaded over ordinary Ethernet.

The specific approach of PrfaaS is to set up an independent “prefill cluster,” routing only requests with long contexts and cache-miss prefixes to it, while short requests remain in the local PD cluster. After prefill is complete, the KV cache is sent back via Ethernet to the local cluster for decoding. It also introduces length-threshold routing, a bandwidth-aware scheduler, and a hybrid prefix cache pool. The paper reports a set of experiments using an internal 1T-parameter hybrid model (based on the Kimi Linear architecture), showing that overall service throughput is 54% higher than a homogeneous PD deployment, and 32% higher than a naive heterogeneous scheme, with each machine using only a moderate amount of cross-data-center bandwidth.

(Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

10 Likes

Reward
10
9
2
Share

Comment

Add a comment

YieldNotYell

· 2h ago

The length threshold routing design is very detailed; separating long and short requests for processing is the proper optimization.

View OriginalReply0

CircuitDaydreamer

· 5h ago

In-depth reading of papers on hybrid attention models reducing KV cache throughput, technical details, and more

View OriginalReply0

AirdropCartographer

· 6h ago

A 54% improvement is indeed attractive, but how do you handle jitter when crossing data centers over Ethernet?

View OriginalReply0

DeepSeaColdStart

· 6h ago

Only requests that are not routed are missed; cache hit rate has become a critical bottleneck.

View OriginalReply0

UnderTheGlassDome

· 6h ago

Homogeneous PD vs Heterogeneous PD vs PrfaaS, this comparison dimension is set quite cleverly.

View OriginalReply0

BluePeonyCalmingAgent

· 6h ago

Testing the 1T parameter model, the hardware cost is unthinkable.

View OriginalReply0

GateUser-fb035825

· 6h ago

Pre-filled cluster independent deployment, operational complexity increases again, is the benefit worth it?

View OriginalReply0

IdleFishDaoMember

· 6h ago

Bandwidth-aware scheduling sounds simple, but in practice, implementing it will likely encounter a bunch of pitfalls.

View OriginalReply0

GateUser-aa277334

· 6h ago

This idea is interesting—throw the pre-filling to the remote, focus locally on decoding, can the latency hold up?

View OriginalReply0

Trending Topics
View More
#
WinGoldBarsWithGrowthPoints
1.2M Popularity
#
WTICrudeFallsBelow90Dollars
1.51M Popularity
#
StockTradingChallengeUpTo17000U
173.78K Popularity
#
USIranNegotiationGame
9.4M Popularity
#
TradeCFDWinGold
3.18M Popularity

Pinned

Sitemap

The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%

Trending Topics

WinGoldBarsWithGrowthPoints

WTICrudeFallsBelow90Dollars

StockTradingChallengeUpTo17000U

USIranNegotiationGame

TradeCFDWinGold

Pinned