In this Dark Moon phase, Yue An routes the pre-fill to a remote data center to run, with the KV cache moving back and forth over Ethernet. The 54% throughput increase is indeed fierce. At last, long context no longer has to max out local bandwidth.

View Original
MeNews
The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%
ME News reports that the Dark Side of the Moon and Tsinghua University proposed Prefill-as-a-Service on arXiv, enabling the pre-filling stage of large model inference to run across data centers. By using a hybrid attention model, they significantly reduce KV cache throughput, allowing the cache to be transmitted over Ethernet and returned to the local cluster for decoding. The PrfaaS architecture builds an independent prefill cluster that only routes requests for long context misses, while short requests stay on the local PD; it also introduces length threshold routing and bandwidth-aware scheduling. In tests with a 1T parameter hybrid model, throughput increased by 54% compared to homogeneous PDs and by 32% compared to naive heterogeneous setups.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned