Anthropic makes AI confess: a LoRA plugin uncovers a hidden behavior that 10 methods used by humans to detect it all miss

robot
Abstract generation in progress

Crypto界网消息,Anthropic对齐团队发布了一种名为“Introspection Adapters”的审计技术,旨在让大型语言模型(LLM)用自然语言报告自己微调后学到的行为。该技术通过从同一基础模型微调出大量具有已知行为的模型,再跨这些模型联合训练一个LoRA适配器,使被审计模型能够主动揭示自己的隐藏行为。在对齐审计基准auditbench上,内省适配器以59%的成功率超过此前所有审计方法,在56个具有隐藏行为的模型中,89%成功引出行为描述。面对加密微调API攻击,内省适配器在9种加密变体中识别出7种,成功率为57.8%。研究还发现,效果随着模型规模的增大而提升,主要局限是高假阳性率。代码、模型和数据集已在GitHub和Hugging Face开源。

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin