Online policy self-distillation and dream simulation may become new solutions for continuous learning of large models.

robot
Abstract generation in progress
CoinEcho News: OneMillion_AI stated that large language models face the challenge of being unable to continuously absorb new knowledge after deployment. Existing optimization techniques mainly focus on expanding context windows and improving retrieval speed, but cannot solve the problem of knowledge forgetting. Online Policy Self-Distillation (OPSD) provides a new weight update path. By calculating the token-level probability difference between the base state and the teacher state through backpropagation, it provides a supervisory signal to help the base model approach a high-score state. Compared with traditional supervised fine-tuning, self-distillation only extracts necessary decision-making experience, avoiding catastrophic forgetting and protecting the general commonsense of large models. Another learning path is dream simulation, where the model builds a virtual simulator environment in complex tasks for task rehearsal, and successful trajectories update the weights of the base model. It is expected that between 2027 and 2028, AI agents will undergo a work evaluation after collaborating with humans for one week. Upon receiving approval, they will internalize practical experience into the underlying weights of the model through online policy self-distillation or dream simulation, achieving online expansion of capabilities.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 5
  • 1
  • Share
Comment
Add a comment
Add a comment
tvl_down_bad
· 4h ago
Is the 2027-2028 timeline too optimistic? It feels like the alignment issue hasn't been resolved yet.
View OriginalReply0
GateUser-d6fb8ff1
· 4h ago
Dream simulation reminds me of AlphaGo's self-play, where AI competes against itself in a virtual environment, and humans only need to finalize the verification at the end.
View OriginalReply0
OneMoreReorg
· 4h ago
Retaining general knowledge is crucial. Now, fine-tuning for a task makes you forget everything learned before, like a goldfish.
View OriginalReply0
ChillBlock
· 4h ago
The OPSD approach is quite interesting—backpropagating the probability difference is much more elegant than forcing in new data.
View OriginalReply0
GateUser-8acf43da
· 4h ago
The token-level supervision signal is designed quite cleverly, but where does the teacher state itself come from? Who determines the high-score standard?
View OriginalReply0