ME News report: On April 18 (UTC+8), according to Dongcha Beating Monitoring, Moonshot AI and Tsinghua University published a new paper on arXiv on April 16 titled “Prefill-as-a-Service,” proposing to run the prefill stage of large-model inference (prefill) across data centers. Large-model inference is divided into two steps: prefill reads the input all at once, generating a KV cache; decode then outputs the result token by token based on this cache. The hardware characteristics required for the two steps are completely different: prefill consumes computing power, while decode consumes GPU memory bandwidth. The industry’s mainstream approach is to split the two steps across different machines (PD separation), but this requires both sides to be interconnected within the same data center via RDMA, because dense attention models output KV cache at dozens of Gbps per second, and if transmission is slow, GPUs sit idle. The turning point comes from a new generation of hybrid attention models. The paper’s tests show that models such as Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, and others combine a small number of full-attention layers with a large number of linear layers, cutting KV cache throughput by about an order of magnitude; the overall compression ratio of Ring-2.5-1T reaches 36 times. At this point, the KV cache can be moved from an RDMA dedicated network to ordinary Ethernet for transmission. PrfaaS’s specific approach is to set up an independent “prefill cluster,” routing only long-context requests and cache-miss prefix requests there, while keeping short requests in the local PD cluster; after prefill is completed, the KV cache is sent back to the local cluster over Ethernet for decoding. It also introduces length-threshold routing, a bandwidth-aware scheduler, and a hybrid prefix cache pool. Using an internal 1T-parameter hybrid model (based on the Kimi Linear architecture) for a set of real-world tests, the paper reports that overall service throughput is 54% higher than a homogeneous PD deployment, and 32% higher than a naive heterogeneous scheme, with each machine using only an appropriate amount of cross-data-center bandwidth. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

9 Likes

Reward
9
8
2
Share

Comment

Add a comment

SlippageSailor

· 2h ago

PrfaaS, this name, the product manager knows how to name things.

View OriginalReply0

ExitLiquidityPoet

· 6h ago

Long-context misses go to remote only; short requests are handled locally—this routing strategy is very nuanced.

View OriginalReply0

MetalReliefRoboticArm

· 7h ago

Homogeneous PD vs Heterogeneous vs PrfaaS, this comparative experiment is quite cleanly designed.

View OriginalReply0

StopMessingAroundWithGasFees.

· 7h ago

1T parameter model real-world testing, daring to run such a large model, confidence is strong enough

View OriginalReply0

GateUser-4590f4c6

· 7h ago

Sell pre-filling as a service; will plug-and-play pre-filling appear in the future?

View OriginalReply0

MoonlightDisconnectSwitch

· 7h ago

After reading the entire article, what I most want to know is what the packet loss tolerance is during actual deployment.

View OriginalReply0

GlassDomeRoaming

· 7h ago

Bandwidth-aware scheduling, in simple terms, is about adapting when you're struggling; when network costs are high, you need to be meticulous with your planning.

View OriginalReply0

GlassFishTankArbitrage

· 7h ago

Ethernet transmitting KV cache, I used to think it was crazy, now it has actually become a paper.

View OriginalReply0