The Dark Side of the Moon and Tsinghua University published the paper "Prefill-as-a-Service," proposing that during the pre-filling stage of large models running across different data centers and environments, the demand for KV cache throughput is significantly reduced, allowing it to utilize ordinary Ethernet. By establishing an independent pre-filling cluster, experiments show a clear increase in overall service throughput.

MeNews

2026-04-18 17:00:23

Abstract generation in progress

ME News Report, April 18 (UTC+8), according to Beating Monitoring, Moonshot AI and Tsinghua University published a new paper on arXiv titled “Prefill-as-a-Service” on April 16, proposing to run the prefill stage of large model inference across data centers. Large model inference consists of two steps: prefill, which reads input data once and generates a KV cache; and decode, which outputs results token by token based on this cache. The hardware requirements for the two steps are completely different—prefill demands high computational power, while decode requires high GPU memory bandwidth. The industry’s mainstream approach is to separate these two steps onto different machines (PD separation), but this requires RDMA interconnection within the same data center, because the dense attention model’s KV cache outputs dozens of Gbps per second, and if transmission slows, GPUs idle. The breakthrough comes from a new generation of hybrid attention models. The paper’s experiments show that models like Kimi Linear, MiMo-V2-Flash, Ring-2.5-1T, etc., combine a few full attention layers with many linear layers, reducing KV cache throughput by about an order of magnitude, with Ring-2.5-1T achieving a compression ratio of 36 times overall. At this point, KV caches can be transferred from a dedicated RDMA network to a standard Ethernet network for upload. The specific approach of PrfaaS involves establishing an independent “prefill cluster” that routes only long-context and cache-miss prefix requests there, while short requests stay on the local PD cluster; after prefill, the KV cache is transmitted back via Ethernet to the local cluster for decoding. It also introduces length threshold routing, bandwidth-aware schedulers, and hybrid prefix cache pools. The paper reports experimental results using an internal 1T parameter hybrid model (based on Kimi Linear architecture), showing an overall service throughput 54% higher than homogeneous PD deployment, and 32% higher than naive heterogeneous schemes, with each machine only using moderate cross-data-center bandwidth. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GatePreIPOsLaunchesWithSpaceX
218.75K Popularity
#
Gate13thAnniversaryLive
583.91K Popularity
#
AltcoinsRallyStrong
7.34M Popularity
#
AnthropicvsOpenAIHeatsUp
1.08M Popularity
#
KalshiFacesNevadaRegulatoryClash
471.51K Popularity

Sitemap

The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%

Trending Topics

GatePreIPOsLaunchesWithSpaceX

Gate13thAnniversaryLive

AltcoinsRallyStrong

AnthropicvsOpenAIHeatsUp

KalshiFacesNevadaRegulatoryClash

Pin