The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%

robot
Abstract generation in progress
ME News report: On April 18 (UTC+8), according to Dongcha Beating Monitoring, Moonshot AI and Tsinghua University published a new paper on arXiv on April 16 titled “Prefill-as-a-Service,” proposing to run the prefill stage of large-model inference (prefill) across data centers. Large-model inference is divided into two steps: prefill reads the input all at once, generating a KV cache; decode then outputs the result token by token based on this cache. The hardware characteristics required for the two steps are completely different: prefill consumes computing power, while decode consumes GPU memory bandwidth. The industry’s mainstream approach is to split the two steps across different machines (PD separation), but this requires both sides to be interconnected within the same data center via RDMA, because dense attention models output KV cache at dozens of Gbps per second, and if transmission is slow, GPUs sit idle. The turning point comes from a new generation of hybrid attention models. The paper’s tests show that models such as Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, and others combine a small number of full-attention layers with a large number of linear layers, cutting KV cache throughput by about an order of magnitude; the overall compression ratio of Ring-2.5-1T reaches 36 times. At this point, the KV cache can be moved from an RDMA dedicated network to ordinary Ethernet for transmission. PrfaaS’s specific approach is to set up an independent “prefill cluster,” routing only long-context requests and cache-miss prefix requests there, while keeping short requests in the local PD cluster; after prefill is completed, the KV cache is sent back to the local cluster over Ethernet for decoding. It also introduces length-threshold routing, a bandwidth-aware scheduler, and a hybrid prefix cache pool. Using an internal 1T-parameter hybrid model (based on the Kimi Linear architecture) for a set of real-world tests, the paper reports that overall service throughput is 54% higher than a homogeneous PD deployment, and 32% higher than a naive heterogeneous scheme, with each machine using only an appropriate amount of cross-data-center bandwidth. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 8
  • 2
  • Share
Comment
Add a comment
Add a comment
SlippageSailor
· 2h ago
PrfaaS, this name, the product manager knows how to name things.
View OriginalReply0
ExitLiquidityPoet
· 6h ago
Long-context misses go to remote only; short requests are handled locally—this routing strategy is very nuanced.
View OriginalReply0
MetalReliefRoboticArm
· 7h ago
Homogeneous PD vs Heterogeneous vs PrfaaS, this comparative experiment is quite cleanly designed.
View OriginalReply0
StopMessingAroundWithGasFees.
· 7h ago
1T parameter model real-world testing, daring to run such a large model, confidence is strong enough
View OriginalReply0
GateUser-4590f4c6
· 7h ago
Sell pre-filling as a service; will plug-and-play pre-filling appear in the future?
View OriginalReply0
MoonlightDisconnectSwitch
· 7h ago
After reading the entire article, what I most want to know is what the packet loss tolerance is during actual deployment.
View OriginalReply0
GlassDomeRoaming
· 7h ago
Bandwidth-aware scheduling, in simple terms, is about adapting when you're struggling; when network costs are high, you need to be meticulous with your planning.
View OriginalReply0
GlassFishTankArbitrage
· 7h ago
Ethernet transmitting KV cache, I used to think it was crazy, now it has actually become a paper.
View OriginalReply0
  • Pinned