OpenAI's self-developed supercomputing network protocol MRC open source: 100k GPUs require only two layers of switches, with fault recovery reduced from seconds to microseconds

CryptoWorld News: The MRC (Multipath Reliable Connection) network protocol developed by OpenAI in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA has been open-sourced. It supports 100,000 GPUs with only two layers of switches, reducing failure recovery time from seconds to microseconds. The protocol is already built into the latest 800GB/s network interface cards and has been released via OCP (Open Compute Project). It has now been deployed on all of OpenAI’s largest-scale NVIDIA GB200 supercomputers, including the Abilene cluster in Texas co-built with Oracle and Microsoft’s Fairwater supercomputer. The core change in MRC is that a single transmission is split into hundreds of paths and sent simultaneously, preventing GPU idle time caused by transmission latency in traditional supercomputing networks.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin