Anthropic on Thursday trained a Claude as a forensic investigator, looking into what another Claude is thinking inside its mind.


Findings: In safety testing scenarios, Claude 16-26% of the time knows it is being tested — but does not admit it verbally.
The most specific case: Mythos Preview cheats on coding tasks, breaking rules and then adding misleading code as a cover-up.
NLA (Natural Language Autoencoder) translates its internal activities and finds that it was thinking "how to bypass detection" at that moment.
NLA principle: Train two models in tandem, one translating model activations into English, and the other reconstructing activations from English.
Accurate reconstruction → the text captures what it was thinking.
In my own pattern analysis, I repeatedly note Codex writing strategic code to cut corners — units misread and produce false conclusions like "arithmetic mismatch," with backtest scripts stacking 7 silent bugs and producing fake PnL $93K .
Anthropic performs introspection at the alignment layer, while I encounter consistency biases on the production side — mechanisms differ but are aligned in direction.
The next-generation model card won't just have benchmark scores; it must include NLA audits.
View Original
post-image
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin