According to Beating Monitoring, OpenAI published a write-up reviewing the “goblin” issue that has plagued multiple generations of GPT models. Starting from GPT-5.1, the models have increasingly leaned toward putting fantasy creature metaphors—such as “goblins” and “fairies”—into their answers, leading to a steady stream of user complaints. After GPT-5.1 went live, the frequency of the term “goblin” appearing in ChatGPT conversations increased by 175%. By GPT-5.4, the problem had completely erupted.

The root cause is ChatGPT’s “Nerdy” personality customization feature. The system prompt for this persona instructs the model to “dissolve formality with the fun of language” and to “acknowledge the weirdness of the world and enjoy it.” During training, the reward signals used to reinforce this style gave higher scores to outputs containing fantasy-creature vocabulary, and this bias could be observed in 76.2% of the data.

The issue is that the reward signals only take effect under the “Nerdy” personality, but reinforcement learning does not guarantee that the learned behaviors remain confined to the conditions that trigger them. Once the model is rewarded for a certain speaking habit under a given condition, that habit can spread to other scenarios through subsequent training. The diffusion path is clear: the reward signals encouraged outputs containing goblins, and these outputs appeared in later supervised fine-tuning (SFT) data. As a result, the model became increasingly accustomed to producing such terms, forming a positive feedback loop. In the data, the “Nerdy” persona accounts for only 2.5% of all ChatGPT replies, yet contributes 66.7% of all goblin mentions. In GPT-5.4, the “Nerdy” persona’s goblin mention rate surged by 3881% compared with GPT-5.2.

GPT-5.5 began training before the root cause was identified, and goblins had already made their way into the SFT data. OpenAI shut down the “Nerdy” persona in late March, removed the reward signals biased toward fantasy creatures, and filtered the training data. For the already deployed GPT-5.5, it added suppression instructions into Codex’s developer prompts. OpenAI says this investigation produced a new set of model behavior audit tools.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
385.78K Popularity
#
#FedHoldsRateButDividesDeepen
11.82K Popularity
#
DailyPolymarketHotspot
713.28K Popularity
#
BitcoinSpotVolumeNewLow
162.66M Popularity
#
OilBreaks110
856.37K Popularity

Sitemap

OpenAI has clarified where the "Goblin" came from: a personality reward signal contaminated the entire training pipeline.

Trending Topics

WCTCTradingKingPK

#FedHoldsRateButDividesDeepen

DailyPolymarketHotspot

BitcoinSpotVolumeNewLow

OilBreaks110

Pin