UCLA + Princeton open-source SDPG: an internal teacher mechanism that lets the agent teach itself, with mathematical reasoning and multi-step planning that directly crush GRPO.

View Original
CoinNetwork
CoinJie.com news: The SDPG algorithm was open-sourced by Liu Yifeng and Zhang Shiyuan from the University of California, Los Angeles (UCLA), along with Zhang Yifan from Princeton University. It is designed to address the self-evolution bottleneck faced by agents when they lack external teacher model guidance. The algorithm uses an internal teacher guidance mechanism to leverage privileged information to generate high-quality reasoning paths, improving the training efficiency and success rate of multi-step decision-making. Evaluation data shows that SDPG performs better than GRPO and multiple self-distillation baseline algorithms in mathematical reasoning and multi-step planning tasks.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned