The Dark Side of the Moon's Prefill-as-a-Service is quite elaborate, offloading pre-filling tasks to remote servers, leaving only decoding locally. This directly halves bandwidth pressure, making the cost-performance ratio for long-context scenarios finally visible.

View Original
MeNews
The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%
ME News reports that the Dark Side of the Moon and Tsinghua University proposed Prefill-as-a-Service on arXiv, enabling the pre-filling stage of large model inference to run across data centers.
By using a hybrid attention model, it significantly reduces KV cache throughput, allowing the cache to be transmitted over Ethernet and returned to the local cluster for decoding.
The PrfaaS architecture builds an independent prefill cluster that only routes requests for long context misses, while short requests stay on the local PD; and introduces length threshold routing and bandwidth-aware scheduling.
In tests with a 1T parameter hybrid model, throughput improved by 54% over homogeneous PDs and by 32% over naive heterogeneous setups.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned