V4 post-training upgrade: OPD replaces hybrid RL, distilling more than ten expert models into one

robot
Abstract generation in progress
ME News reports that on April 24 (UTC+8), according to Dongcha Beating monitoring, DeepSeek V4's post-training methodology has undergone major changes: the mixed RL phase in V3.2 has been completely replaced by On-Policy Distillation (OPD). The new process consists of two steps. The first step involves training domain expert models separately for areas such as mathematics, code, Agent, and instruction following based on the V3.2 pipeline. Each expert is first fine-tuned and then undergoes reinforcement learning using GRPO. The second step uses multi-teacher OPD to distill the capabilities of more than ten experts into a unified model: on the trajectories generated by the student itself, reverse KL divergence-based full-vocabulary logit distillation is performed for each teacher. Through logit-level alignment, the weights of multiple experts are merged into a unified parameter space, avoiding the common capability conflicts seen in traditional weight merging and mixed RL. The report also introduces a Generative Reward Model (GRM): for tasks where rule-based verification is difficult, instead of training a traditional scalar reward model, the GRM is trained using rubric-guided RL data, allowing the actor network to simultaneously handle generation and evaluation capabilities. With a small amount of diverse human annotations, it can generalize to complex tasks. (Source: BlockBeats)
DEEPSEEK-2.53%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned