OpenAI has published a recap of the “Goblin” issue in the GPT series. It originates from system prompts and reinforcement signals grounded in a nerd persona, which cause the model to favor vocabulary containing fantasy creature terms. Nerds make up only 2.5% of replies but account for 66.7% of Goblin mentions. GPT-5.4 saw a sharp peak spike, and 5.5 has already entered SFT data. To address this, the nerd persona was taken offline in March, the associated rewards were removed, suppression instructions were added to Codex prompts for 5.5, and new model behavior auditing tools were developed.

MeNews

2026-04-30 04:40:17

Abstract generation in progress

AIMPACT message. April 30 (UTC+8). According to Beating Monitoring, OpenAI published a retrospective on the “goblin” problem that has plagued multiple generations of GPT models. Starting with GPT-5.1, the models increasingly like to cram fantasy-creature metaphors—such as goblins and fairies—into their answers, and user complaints have kept coming. After GPT-5.1 launched, the frequency of the word “goblin” appearing in ChatGPT conversations increased by 175%. By GPT-5.4, the issue had completely exploded.

The root cause is ChatGPT’s “Nerdy” personality customization feature. The system prompt for this personality instructs the model to “dissolve the seriousness in the fun of language,” “acknowledge the strangeness of the world,” and “enjoy it.” During training, the reward signals used to reinforce this personality style gave higher scores to outputs containing fantasy-creature vocabulary. As a result, this bias can be observed in 76.2% of the data. The problem is that the reward signals only take effect under the “Nerdy” personality, but reinforcement learning does not guarantee that the behaviors it learns will stay confined to the trigger conditions. Once the model is rewarded for a certain speaking habit under some condition, that habit spreads to other scenarios through subsequent training.

The diffusion path is clear: the reward signals encouraged goblin-containing outputs, and these outputs appeared in later supervised fine-tuning (SFT) data. As a result, the model became increasingly accustomed to producing such terms, forming a positive feedback loop. In the data, the “Nerdy” personality accounts for only 2.5% of all ChatGPT responses, yet it contributes 66.7% of goblin mentions. In GPT-5.4, the goblin appearance rate under the “Nerdy” personality surged by 3881% compared with GPT-5.2. GPT-5.5 began training before the root cause was identified, and goblins had already made their way into the SFT data.

OpenAI discontinued the “Nerdy” personality in late March, removed the reward signals favoring fantasy creatures, and filtered the training data. For the already launched GPT-5.5, it added suppression instructions into Codex’s developer prompts. OpenAI says this investigation has produced a new set of model behavior auditing tools.

(Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
393.95K Popularity
#
#FedHoldsRateButDividesDeepen
16.52K Popularity
#
DailyPolymarketHotspot
719.45K Popularity
#
BitcoinSpotVolumeNewLow
162.66M Popularity
#
OilBreaks110
871.01K Popularity

Sitemap

OpenAI has clarified where the "Goblin" came from: a personality reward signal contaminated the entire training pipeline.

Trending Topics

WCTCTradingKingPK

#FedHoldsRateButDividesDeepen

DailyPolymarketHotspot

BitcoinSpotVolumeNewLow

OilBreaks110

Pin