Tencent Hunyuan this time integrated both LLM and diffusion models into the same reinforcement learning framework, with flow-dppo and drpo dual algorithms launched simultaneously. The technical approach is quite bold.

View Original
CoinNetwork
Crypto news, Tencent Hunyuan open-sources UniRL, integrating large language models and diffusion models into the same reinforcement learning training framework, enabling text, visual language, image, and video generation models to share a unified training cycle.
For diffusion and flow-matching models, the Hunyuan team has introduced the flow-dppo algorithm, utilizing the Gaussian distribution characteristics of each step's policy in flow-matching models to directly constrain policy updates with KL divergence, and employing asymmetric divergence masks to prevent the model from deviating too far, maintaining stable convergence.
For large language models, the team simultaneously released the drpo algorithm, introducing advantage-weighted quadratic regularization to replace hard truncation, ensuring the model still receives continuous gradient correction signals when deviating from the target distribution.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned