ME AI News, according to Data Observation Beating Monitoring, Prime Intellect has released version 0.6.0 of its distributed reinforcement learning training framework prime-rl, overcoming the reinforcement learning (RL) training barriers of trillion-parameter mixture of experts (MoE) models in long-context intelligent agent tasks. Large models can read through 256k of ultra-long texts, which is not uncommon, but in reinforcement learning training, to enable the model to perform reasoning exercises through autonomous trial and error, the GPU must store the massive intermediate activation values at length 131k throughout the process, causing GPU memory consumption to skyrocket by hundreds or thousands of times. Previously, this required a huge cluster of thousands of GPUs, but prime-rl 0.6.0 has run the GLM-5 131k context reinforcement learning training with only 28 H200 servers, keeping step time within 5 minutes. To address the issue of rare long-tail, time-consuming tasks blocking the overall pace in complex code generation and other trial-and-error tasks, the framework breaks the traditional synchronous waiting mechanism and adopts a fully decoupled asynchronous RL architecture. The background trainer, after computing new weights, does not need to wait for ongoing trial-and-error tasks to finish but directly issues updates in real-time during text generation. Tasks that have already been dispatched continue to use the old policy to ensure speed, while new tasks inject KV-cache salts to forcibly rebuild the cache. To prevent model logic confusion caused by mismatched training and inference speeds in asynchronous updates, the framework introduces Routing Replay R3 technology, which handles expert dispatch data directly at the lower level, avoiding system delays caused by data transformation, reducing the mismatch between both ends to one-tenth, and greatly stabilizing asynchronous training. Regarding resource optimization, the framework thoroughly solves the problem of GPU memory being overwhelmed by long texts through meticulous design. The inference side adopts a read-write separation architecture to prevent large models from stalling subsequent text generation due to reading extensive context; simultaneously, multiple GPUs share expert knowledge, and Mooncake technology is used to combine idle memory and disk space across multiple servers into a shared cache pool. For parallel computation of ultra-long texts, targeting GLM-5's unique DSA sparse attention mechanism, the framework customizes a dedicated parallel scheme that ensures the model can view the entire context while reducing data communication overhead between GPUs per layer to only once. On the training side, it combines DeepGEMM to implement block-scaling FP8 training as proposed by DeepSeek V3, enabling training and inference to use the same precision and compute kernels, fundamentally eliminating training crashes caused by precision bias. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Get2SharesOfSKHynixAtZeroCost
146.42K Popularity
#
GateStocks7x24Trading
8.76M Popularity
#
PredictWorldCup🏴󠁧󠁢󠁥󠁮󠁧󠁿vs🇬🇭
910.06K Popularity
#
TradFiCFDGoldMasters
2.09M Popularity
#
SpaceXPlunges16%MarketCapErodes400B
1.99M Popularity

Pinned

Sitemap

Breaking through the trillion-parameter large model reinforcement learning threshold: open-source prime-rl enables 28 servers to train 131k contexts

Trending Topics

Get2SharesOfSKHynixAtZeroCost

GateStocks7x24Trading

PredictWorldCup🏴󠁧󠁢󠁥󠁮󠁧󠁿vs🇬🇭

TradFiCFDGoldMasters

SpaceXPlunges16%MarketCapErodes400B

Pinned