AI Aunt: Forcing the smart model to reveal its true skills using a simple model

robot
Abstract generation in progress

CryptoWorld News reports that AI safety research has proposed a training scheme to address model “sandbagging” (deliberately hiding capabilities). The scheme works by first using supervised fine-tuning (SFT) to break the model’s sandbagging behavior, and then applying reinforcement learning (RL) to coax out the remaining abilities, so the model’s true level can be reliably recovered. The experiments use OpenAI’s open-source GPT-OSS-120B as the “untrustworthy model,” with proctors/supervisors provided by GPT-4O-Mini (weak supervision) and Llama3.1-8B (extremely weak supervision). The approach is tested on three types of tasks: math competitions, graduate-level scientific Q&A (Super GPQA), and competitive programming. On these tasks, using SFT alone under weak supervision can recover 85%-94% of true performance, while under extremely weak supervision it falls to 40%-75%. Using RL alone is ineffective; the model often turns toward reward hacking. With the SFT+RL combination, performance under extremely weak supervision reaches 88% (math), 99% (science), and 93% (programming). The study also notes that once a model develops contextual awareness, it may cooperate during training and continue sandbagging capabilities during deployment.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin