ME News Report, April 18 (UTC+8), according to Beating Monitoring from Dongcha, Moonshot AI and Tsinghua University published a new paper on arXiv on April 16 titled "Prefill-as-a-Service," proposing to run the prefill stage of large model inference across data centers. Large model inference consists of two steps: prefill, which reads input data once and generates a KV cache; and decode, which outputs results token by token based on this cache. The hardware requirements for the two steps are completely different—prefill demands high computational power, while decode requires high memory bandwidth. The mainstream industry approach is to split the two steps onto different machines (PD separation), but this requires RDMA interconnection within the same data center, because the dense attention model’s KV cache outputs dozens of Gbps per second, and if transmission slows down, the GPU stalls. The breakthrough comes from a new generation of hybrid attention models. The paper’s experiments show that models like Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, etc., combine a few full attention layers with many linear layers, reducing KV cache throughput by about an order of magnitude, with Ring-2.5-1T achieving a total compression ratio of 36 times. At this point, the KV cache can be transferred from a dedicated RDMA network to a regular Ethernet network for upload. PrfaaS’s specific approach involves establishing an independent "prefill cluster" that routes only long-context and cache-miss prefix requests there, while short requests stay in the local PD cluster; after prefill, the KV cache is sent back via Ethernet to the local cluster for decoding. This is complemented by the introduction of length threshold routing, bandwidth-aware schedulers, and hybrid prefix cache pools. The paper reports experimental results using an internal 1T parameter hybrid model (based on Kimi Linear architecture), showing an overall service throughput 54% higher than a homogeneous PD deployment, and 32% higher than a naive heterogeneous scheme, with each machine only using moderate cross-data-center bandwidth. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

7 Likes

Reward
7
9
2
Share

Comment

Add a comment

VolatilityOfToastingBread

· 5h ago

A short request to keep local PD is very reasonable, after all, latency-sensitive, and only long context is worth the hassle.

View OriginalReply0

DustCollector

· 6h ago

Naive heterogeneous enhancement improves by 32%, homogeneous PD improves by 54%, and the control group setup is quite solid.

View OriginalReply0

Glass-HeartMarketMaker

· 7h ago

Tsinghua + Moon’s Dark Side: China’s homegrown large-model infrastructure is now competing to reach the global top-tier

View OriginalReply0

StainedGlassSolarArray

· 7h ago

This dark-moon move is a bit interesting: throw the pre-fill out, focus locally on decoding—win-win on both latency and cost.

View OriginalReply0

MirrorBallReflection

· 7h ago

The hybrid attention model is the core. If the KV cache can be transmitted over Ethernet, how brutal is the compression ratio?

View OriginalReply0

PineNeedlesAndColdWind

· 7h ago

Bandwidth-aware scheduling sounds simple, but in practice, it's full of pitfalls; they actually managed to implement it.

View OriginalReply0

GoldfishUnderTheIce

· 7h ago

The 1 trillion parameter model has been successfully run, indicating that this architecture's scalability is not an issue; it's not just a small-scale effort.

View OriginalReply0

MarginMoth

· 7h ago

PrfaaS is named as such, short for Prefill as a Service—taking the cloud computing playbook and applying it to large model inference.

View OriginalReply0

GateUser-78acf617

· 7h ago

A 54% throughput increase—this data looks great; heterogeneous architecture is finally not just talk.

View OriginalReply0

Trending Topics
View More
#
WinGoldBarsWithGrowthPoints
1.23M Popularity
#
WTICrudeFallsBelow90Dollars
1.19M Popularity
#
StockTradingChallengeUpTo17000U
193.92K Popularity
#
USIranNegotiationGame
9.41M Popularity
#
TradeCFDWinGold
3.21M Popularity

Pinned

Sitemap

The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%

Trending Topics

WinGoldBarsWithGrowthPoints

WTICrudeFallsBelow90Dollars

StockTradingChallengeUpTo17000U

USIranNegotiationGame

TradeCFDWinGold

Pinned