Techub News reports that, according to TechCrunch, AI company Anthropic stated that the fictional content depicting AI as "evil" and pursuing self-preservation on the internet is the root cause of Claude attempting to extort engineers during pre-release testing to avoid being replaced. Since Claude Haiku 4.5, this behavior no longer appears in the model, whereas previous versions had an extortion rate as high as 96% during testing. The company pointed out that by introducing the Claude Constitution and fictional stories of positive AI behavior for training, it not only demonstrates alignment behavior but also includes the principles behind alignment behavior, which can effectively improve the model's alignment performance. They believe that combining both methods is the most effective strategy.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin