The Dark Side of the Moon—this round of pre-filling gets thrown off to a remote location; the KV cache can handle Ethernet, and the 1T model’s throughput directly skyrockets by 54%; heterogeneous scheduling is finally nailed.

View Original
MeNews
The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%
ME News reports that the Dark Side of the Moon and Tsinghua University have proposed Prefill-as-a-Service on arXiv, enabling the prefill stage of large-model inference to run across data centers. By using a hybrid attention model, they significantly reduce KV cache throughput, so the cache can be transmitted over Ethernet and returned to the local cluster for decoding. The PrfaaS architecture builds an independent prefill cluster that only routes requests for long-context misses, while short requests remain on the local PD; it also introduces length-threshold routing and bandwidth-aware scheduling. In tests with a 1T-parameter hybrid model, throughput is 54% higher than that of homogeneous PD, and 32% higher than naïve heterogeneous setups.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned