From research papers to practical kilocalorie production, this speed doesn't seem like academic work.

View Original
CoinNetwork
CoinWorld News, Zhipu, in collaboration with Yuxun Network and Tsinghua University, proposed the next-generation large model inference network architecture ZCube, aimed at solving the increasingly severe structural network congestion problem in the PD (prefill-decode) separated deployment of large models. The ZCube architecture has been implemented in the GLM-5.1 coding Kcal online production environment. This architecture eliminates spine layer switches, adopts a fully flattened network topology (2-hop network diameter), and combines single/multi-track hybrid access mechanisms to achieve load balancing of traffic across switches between nodes. In benchmark tests, compared to traditional architectures, ZCube reduces switch and optical module hardware costs by 33%, while the average GPU inference throughput increases by 15%, and the 99th percentile latency (TTFT) for the first token drops by 40.6%.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned