ME News Report, April 18 (UTC+8), according to Beating Monitoring from Dongcha, Moonshot AI and Tsinghua University published a new paper on arXiv on April 16 titled "Prefill-as-a-Service," proposing to run the prefill stage of large model inference across data centers. Large model inference consists of two steps: prefill, which reads input all at once and generates a KV cache; and decode, which outputs results token by token based on this cache. The hardware requirements for the two steps are completely different—prefill consumes computational power, while decode requires GPU memory bandwidth. The industry’s mainstream approach is to split the two steps onto different machines (PD separation), but this requires RDMA interconnection within the same data center, because the dense attention model’s KV cache outputs dozens of Gbps per second, and if transmission slows down, GPUs idle. The breakthrough comes from a new generation of hybrid attention models. The paper’s experiments show that models like Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, etc., combine a small number of full attention layers with many linear layers, reducing KV cache throughput by about an order of magnitude, with Ring-2.5-1T achieving a total compression ratio of 36 times. At this point, the KV cache can be transferred from a dedicated RDMA network to a standard Ethernet network for upload. PrfaaS’s specific approach involves establishing an independent "prefill cluster" that only routes requests with long contexts or unhit prefixes, while short requests stay on the local PD cluster; after prefill, the KV cache is transmitted back via Ethernet to the local cluster for decoding. It also introduces length threshold routing, bandwidth-aware schedulers, and hybrid prefix cache pools. The paper reports experimental results using an internal 1T parameter hybrid model (based on Kimi Linear architecture), showing an overall service throughput 54% higher than a homogeneous PD deployment, and 32% higher than a naive heterogeneous scheme, with each machine only using moderate cross-data-center bandwidth. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

7 Likes

Reward
7
6
Repost
Share

Comment

Add a comment

DewdropSapling

· 28m ago

PrfaaS is the name chosen—will there be Decode-as-a-Service in the future?

View OriginalReply0

InstantNoodle-LevelResearcher

· 1h ago

Tsinghua + Moon’s Dark Side — domestic LLM infrastructure is starting to compete in a new direction

View OriginalReply0

LateBlockLarry

· 1h ago

A 54% increase looks attractive, but in practice, considerations must include multi-tenant isolation and fault recovery.

View OriginalReply0

MempoolMaggie

· 1h ago

Is transmitting KV cache over Ethernet more expensive in bandwidth costs than computing power?

View OriginalReply0

MintLiquidationWarning

· 2h ago

Only routing long-context requests misses, while short requests stay local—this hierarchical routing strategy is quite practical.

View OriginalReply0

GateUser-2100b43b

· 2h ago

The hybrid attention model reduces KV cache throughput; this idea reminds me of some tricks from early distributed training.

View OriginalReply0

Trending Topics
View More
#
WinGoldBarsWithGrowthPoints
1.21M Popularity
#
WTICrudeFallsBelow90Dollars
1.53M Popularity
#
StockTradingChallengeUpTo17000U
186.79K Popularity
#
USIranNegotiationGame
9.41M Popularity
#
TradeCFDWinGold
3.19M Popularity

Pinned

Sitemap

The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%

Trending Topics

WinGoldBarsWithGrowthPoints

WTICrudeFallsBelow90Dollars

StockTradingChallengeUpTo17000U

USIranNegotiationGame

TradeCFDWinGold

Pinned