Tsinghua + Moon's Dark Side split the pre-fill process out to run separately, KV cache can handle it over Ethernet, a 54% throughput increase is indeed impressive, the cost structure for long-context reasoning is about to change.

View Original
MeNews
The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%
ME News reports that the Dark Side of the Moon and Tsinghua University proposed Prefill-as-a-Service on arXiv, enabling the pre-filling stage of large model inference to run across data centers. By using a hybrid attention model, they significantly reduce KV cache throughput, allowing the cache to be transmitted over Ethernet and returned to the local cluster for decoding. The PrfaaS architecture builds an independent prefill cluster that only routes requests for long contexts that miss, while short requests stay on the local PD; it also introduces length threshold routing and bandwidth-aware scheduling. In tests with a 1T parameter hybrid model, throughput increased by 54% compared to homogeneous PD, and by 32% compared to naive heterogeneous setups.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned