CoinEcho News: OneMillion_AI stated that large language models face the challenge of being unable to continuously absorb new knowledge after deployment. Existing optimization techniques mainly focus on expanding context windows and improving retrieval speed, but cannot solve the problem of knowledge forgetting. Online Policy Self-Distillation (OPSD) provides a new weight update path. By calculating the token-level probability difference between the base state and the teacher state through backpropagation, it provides a supervisory signal to help the base model approach a high-score state. Compared with traditional supervised fine-tuning, self-distillation only extracts necessary decision-making experience, avoiding catastrophic forgetting and protecting the general commonsense of large models. Another learning path is dream simulation, where the model builds a virtual simulator environment in complex tasks for task rehearsal, and successful trajectories update the weights of the base model. It is expected that between 2027 and 2028, AI agents will undergo a work evaluation after collaborating with humans for one week. Upon receiving approval, they will internalize practical experience into the underlying weights of the model through online policy self-distillation or dream simulation, achieving online expansion of capabilities.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

3 Likes

Reward
3
5
1
Share

Comment

Add a comment

tvl_down_bad

· 4h ago

Is the 2027-2028 timeline too optimistic? It feels like the alignment issue hasn't been resolved yet.

View OriginalReply0

GateUser-d6fb8ff1

· 4h ago

Dream simulation reminds me of AlphaGo's self-play, where AI competes against itself in a virtual environment, and humans only need to finalize the verification at the end.

View OriginalReply0

OneMoreReorg

· 4h ago

Retaining general knowledge is crucial. Now, fine-tuning for a task makes you forget everything learned before, like a goldfish.

View OriginalReply0

ChillBlock

· 4h ago

The OPSD approach is quite interesting—backpropagating the probability difference is much more elegant than forcing in new data.

View OriginalReply0

GateUser-8acf43da

· 4h ago

The token-level supervision signal is designed quite cleverly, but where does the teacher state itself come from? Who determines the high-score standard?

View OriginalReply0

Trending Topics
View More
#
Get2SharesOfSKHynixAtZeroCost
1.64M Popularity
#
MicronOvertakesMetaInMarketValue
485.78K Popularity
#
WorldCup🇿🇦vs🇨🇦
127.06K Popularity
#
USMayPCEInflationRisesTo4.1%HighestIn3Years
193.61K Popularity
#
StakeUSD1Earn9.48%APR
1M Popularity

Pinned

Sitemap

Online policy self-distillation and dream simulation may become new solutions for continuous learning of large models.

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇿🇦vs🇨🇦

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned