PrfaaS: This architecture is interesting—only when long-context requests miss does it route to an independent pre-fill cluster; short requests are handled locally by PD; and bandwidth-aware scheduling prevents unnecessary congestion.

View Original
MeNews
The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%
ME News reports that the Dark Side of the Moon and Tsinghua University proposed Prefill-as-a-Service on arXiv, enabling the pre-filling stage of large model inference to run across data centers.
By using a hybrid attention model, it significantly reduces KV cache throughput, allowing the cache to be transmitted over Ethernet and returned to the local cluster for decoding.
The PrfaaS architecture builds an independent prefill cluster that only routes requests for long context misses, while short requests stay on the local PD; and introduces length threshold routing and bandwidth-aware scheduling.
In tests with a 1T parameter hybrid model, throughput improved by 54% over homogeneous PD, and by 32% over naive heterogeneous setups.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned