The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%

robot
Abstract generation in progress

ME News message: On April 18 (UTC+8), according to Beating Monitoring, Moonshot AI and Tsinghua University released a new paper on arXiv on April 16 titled “Prefill-as-a-Service,” proposing to run the prefill stage (prefill) of large-model inference across data centers. Large-model inference consists of two steps: prefill first reads the input all at once and generates a KV cache; decode then outputs the result token by token based on this cache. The hardware requirements for the two steps are completely different: prefill consumes compute, while decode consumes memory bandwidth. The mainstream industry approach is to split the two steps across different machines (PD separation), but this requires RDMA interconnection within the same data center, because the KV cache of dense attention models outputs tens of Gbps per second; if transmission is slow, the GPUs will sit idle. The turning point comes from a new generation of hybrid attention models. The paper’s experiments show that models such as Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, etc., reduce KV cache throughput by about one order of magnitude by combining a small number of full attention layers with a large number of linear layers; Ring-2.5-1T achieves an overall compression ratio of 36 times. At this point, the KV cache can be moved from the dedicated RDMA network to be uploaded over ordinary Ethernet.

The specific approach of PrfaaS is to set up an independent “prefill cluster,” routing only requests with long contexts and cache-miss prefixes to it, while short requests remain in the local PD cluster. After prefill is complete, the KV cache is sent back via Ethernet to the local cluster for decoding. It also introduces length-threshold routing, a bandwidth-aware scheduler, and a hybrid prefix cache pool. The paper reports a set of experiments using an internal 1T-parameter hybrid model (based on the Kimi Linear architecture), showing that overall service throughput is 54% higher than a homogeneous PD deployment, and 32% higher than a naive heterogeneous scheme, with each machine using only a moderate amount of cross-data-center bandwidth.

(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 9
  • 2
  • Share
Comment
Add a comment
Add a comment
YieldNotYell
· 2h ago
The length threshold routing design is very detailed; separating long and short requests for processing is the proper optimization.
View OriginalReply0
CircuitDaydreamer
· 5h ago
In-depth reading of papers on hybrid attention models reducing KV cache throughput, technical details, and more
View OriginalReply0
AirdropCartographer
· 6h ago
A 54% improvement is indeed attractive, but how do you handle jitter when crossing data centers over Ethernet?
View OriginalReply0
DeepSeaColdStart
· 6h ago
Only requests that are not routed are missed; cache hit rate has become a critical bottleneck.
View OriginalReply0
UnderTheGlassDome
· 6h ago
Homogeneous PD vs Heterogeneous PD vs PrfaaS, this comparison dimension is set quite cleverly.
View OriginalReply0
BluePeonyCalmingAgent
· 6h ago
Testing the 1T parameter model, the hardware cost is unthinkable.
View OriginalReply0
GateUser-fb035825
· 6h ago
Pre-filled cluster independent deployment, operational complexity increases again, is the benefit worth it?
View OriginalReply0
IdleFishDaoMember
· 6h ago
Bandwidth-aware scheduling sounds simple, but in practice, implementing it will likely encounter a bunch of pitfalls.
View OriginalReply0
GateUser-aa277334
· 6h ago
This idea is interesting—throw the pre-filling to the remote, focus locally on decoding, can the latency hold up?
View OriginalReply0
View More
  • Pinned