OpenAI leads AMD, NVIDIA, Intel, Microsoft, and Broadcom giants in a rare joint effort to tackle AI network layers

robot
Abstract generation in progress

Golden Finance reports that on May 7th, according to Kuai Technology, OpenAI officially released the MRC (Multi-Path Reliable Connection) protocol through the Open Compute Project (OCP), addressing GPU network communication bottlenecks in large-scale AI training. The protocol was jointly developed over two years by OpenAI, AMD, NVIDIA, Intel, Microsoft, and Broadcom, and is now in practical use in supercomputing clusters equipped with NVIDIA GB200.
The core issue MRC aims to solve is: during the training of large-scale AI models, a single data transfer delay can cause the entire training process to halt, with GPUs idling and waiting. As the cluster size increases, delays caused by network congestion, link, and device failures become more frequent. MRC’s solution is to split a single 800Gb/s network interface into multiple smaller links, for example, connecting one interface to 8 different switches to build 8 independent 100Gb/s parallel networks, rather than relying on a single 800Gb/s network.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin