Tsinghua + Moon’s Dark Side—this combination is kind of interesting. Toss the prefill over to a remote data center to run, and the shackles of RDMA are finally loosened.

View Original
MeNews
The Dark Side of the Moon and Tsinghua's New Paper: LLM Pre-Filling Can Cross Data Centers, 1T Model Throughput Increased by 54%
ME News message: On April 18 (UTC+8), according to Beating Monitoring, Moonshot AI and Tsinghua University posted a new paper on arXiv on April 16 titled “Prefill-as-a-Service,” proposing to run the prefill stage (prefill) of large model inference across data centers. Large model inference is divided into two steps: prefill reads the input all at once and generates a KV cache; decode then outputs results token by token based on this cache. The hardware requirements for the two steps are completely different—prefill consumes compute, while decode relies on GPU memory bandwidth. The industry’s mainstream approach is to split the two steps across different machines (PD separation), but this requires both sides to be interconnected within the same data center via RDMA because of dense
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned