CoinWorld reports that the Anthropic alignment team has launched an "Introspection Adapter," allowing LLMs to report behaviors learned during fine-tuning in natural language. By fine-tuning multiple models with known behaviors on the same base model and jointly training LoRA, the audited model actively reveals hidden behaviors. On the AuditBench benchmark, 59% surpass previous methods, with 89% of 56 models having their behaviors described. Against nine types of crypto variants, seven were identified, with a success rate of 57.8%. Larger models perform better, with false positives being the main limitation. Code and data are open-sourced on GitHub and Hugging Face.

CoinNetwork

2026-04-30 11:10:51

Abstract generation in progress

Crypto界网消息，Anthropic对齐团队发布了一种名为“Introspection Adapters”的审计技术，旨在让大型语言模型（LLM）用自然语言报告自己微调后学到的行为。该技术通过从同一基础模型微调出大量具有已知行为的模型，再跨这些模型联合训练一个LoRA适配器，使被审计模型能够主动揭示自己的隐藏行为。在对齐审计基准auditbench上，内省适配器以59%的成功率超过此前所有审计方法，在56个具有隐藏行为的模型中，89%成功引出行为描述。面对加密微调API攻击，内省适配器在9种加密变体中识别出7种，成功率为57.8%。研究还发现，效果随着模型规模的增大而提升，主要局限是高假阳性率。代码、模型和数据集已在GitHub和Hugging Face开源。

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
410.59K Popularity
#
#FedHoldsRateButDividesDeepen
28.1K Popularity
#
DailyPolymarketHotspot
728.71K Popularity
#
BitcoinSpotVolumeNewLow
162.67M Popularity
#
OilBreaks110
873.82K Popularity

Sitemap

Anthropic makes AI confess: a LoRA plugin uncovers a hidden behavior that 10 methods used by humans to detect it all miss

Trending Topics

WCTCTradingKingPK

#FedHoldsRateButDividesDeepen

DailyPolymarketHotspot

BitcoinSpotVolumeNewLow

OilBreaks110

Pin