Breaking through the trillion-parameter large model reinforcement learning threshold: open-source prime-rl enables 28 servers to train 131k contexts

robot
Abstract generation in progress
ME AI News, according to Data Observation Beating Monitoring, Prime Intellect has released version 0.6.0 of its distributed reinforcement learning training framework prime-rl, overcoming the reinforcement learning (RL) training barriers of trillion-parameter mixture of experts (MoE) models in long-context intelligent agent tasks. Large models can read through 256k of ultra-long texts, which is not uncommon, but in reinforcement learning training, to enable the model to perform reasoning exercises through autonomous trial and error, the GPU must store the massive intermediate activation values at length 131k throughout the process, causing GPU memory consumption to skyrocket by hundreds or thousands of times. Previously, this required a huge cluster of thousands of GPUs, but prime-rl 0.6.0 has run the GLM-5 131k context reinforcement learning training with only 28 H200 servers, keeping step time within 5 minutes. To address the issue of rare long-tail, time-consuming tasks blocking the overall pace in complex code generation and other trial-and-error tasks, the framework breaks the traditional synchronous waiting mechanism and adopts a fully decoupled asynchronous RL architecture. The background trainer, after computing new weights, does not need to wait for ongoing trial-and-error tasks to finish but directly issues updates in real-time during text generation. Tasks that have already been dispatched continue to use the old policy to ensure speed, while new tasks inject KV-cache salts to forcibly rebuild the cache. To prevent model logic confusion caused by mismatched training and inference speeds in asynchronous updates, the framework introduces Routing Replay R3 technology, which handles expert dispatch data directly at the lower level, avoiding system delays caused by data transformation, reducing the mismatch between both ends to one-tenth, and greatly stabilizing asynchronous training. Regarding resource optimization, the framework thoroughly solves the problem of GPU memory being overwhelmed by long texts through meticulous design. The inference side adopts a read-write separation architecture to prevent large models from stalling subsequent text generation due to reading extensive context; simultaneously, multiple GPUs share expert knowledge, and Mooncake technology is used to combine idle memory and disk space across multiple servers into a shared cache pool. For parallel computation of ultra-long texts, targeting GLM-5's unique DSA sparse attention mechanism, the framework customizes a dedicated parallel scheme that ensures the model can view the entire context while reducing data communication overhead between GPUs per layer to only once. On the training side, it combines DeepGEMM to implement block-scaling FP8 training as proposed by DeepSeek V3, enabling training and inference to use the same precision and compute kernels, fundamentally eliminating training crashes caused by precision bias. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments