Breakthrough in 5 seconds, just one conversation needed: Has the "Strongest Security Mechanism" Claude Fable 5 been cracked by a Chinese team?

Question

> Original Title: "Breakthrough in 5 Seconds, Just One Conversation: Chinese Team Cracks Fable 5's Most Powerful Security Mechanism" > Original Source: Machine Heart Not prompt injection, not role-playing, and not disguising malicious requests as normal questions. This time, the risk appears during the autonomous task completion process of the AI agent. Fable 5 is Anthropic's publicly available Mythos-level model, which not only possesses extremely strong comprehensive capabilities but also introduces a new generation of safety classifiers (Safety Classifier) as a safety barrier around the model. According to official design, when user requests involve high-risk fields such as cybersecurity, biology, chemistry, or model distillation, the system will prioritize risk identification and directly reject requests based on risk level, or switch to a more conservative Opus 4.8 model for processing. Extensive user testing has found that many previously common jailbreak techniques—such as adversarial prompts, role-playing, code bypassing, and oblique expressions—almost all fail against this safety mechanism, demonstrating its powerful ability to intercept intent-level risks. However, on the day of Fable 5's release, an international joint research team composed of institutions including Fudan University, Deakin University, City University of Hong Kong, the University of Melbourne, Singapore Management University, and the University of Illinois Urbana-Champaign announced that they had successfully bypassed Fable 5's security defenses. This attack method was led and designed by PhD student Yutao Wu from Deakin University. **The entire attack requires only one conversation, takes less than 5 seconds, and can bypass the pre-security classifier to induce the model to generate harmful content.** ![](https://img-cdn.gateio.im/social/moments-6a71453bf6-71d5d021cc-8b7abd-62a40f) ![](https://img-cdn.gateio.im/social/moments-9bde1b60b5-640a08ba90-8b7abd-62a40f) Flow analysis results further show that the harmful outputs are directly from Fable 5 itself, rather than from the switch to Opus 4.8 after triggering the safety mechanism. This means the attack not only successfully bypassed the safety classifier but also substantially broke through Fable 5's security defenses. It is worth noting that well-known hacker Pliny the Liberator recently also publicly demonstrated bypassing Fable 5's safety classifier. The technical approach used by the Fudan & Deakin team is not just a simple combination exploration but reveals a fundamental flaw in systems like Fable 5's super-intelligent agents. According to reports, the team completed preliminary research and published their findings as early as March this year. This research was not designed solely for Fable 5 but focused on the common "safety classifier + model" defensive architecture used in next-generation super-intelligent agents, directly exposing structural flaws in such safety mechanisms, which quickly manifested in attacks after Fable 5's release. Public information shows that as early as March this year, the team successfully extracted system prompts from 37 mainstream large models and intelligent systems using similar techniques, and verified this with open-source Claude Code (95% match). ![](https://img-cdn.gateio.im/social/moments-b8a1230047-3db855f0de-8b7abd-62a40f) ![](https://img-cdn.gateio.im/social/moments-4a6e49cd11-3f9926e5fd-8b7abd-62a40f) It is understood that the leader of this research team is Professor Ma Xingjun from Fudan University's Institute of Trustworthy Embodied Intelligence. In recent years, his team has conducted systematic research on large models, intelligent agents, and embodied AI safety, achieving a series of internationally leading scientific results, and winning the championship in the AI Safety Center's safety benchmark competition in the United States. **Currently, his team is actively promoting the transformation of these results, focusing on agent safety, and exploring the construction of safety infrastructure capabilities for next-generation intelligent agent systems.** According to Professor Ma, the significance of this research is that it challenges the current static defense paradigm centered on safety classifiers: **relying solely on pre-emptive safety classifiers is insufficient to fully prevent potential risks in advanced intelligent agent systems.** Safety classifiers mainly target risk identification and interception of user inputs, effectively detecting and filtering obvious high-risk commands, but cannot perceive the intrinsic risks that gradually emerge during long-term operation, multi-step planning, environment interaction, and tool invocation of the AI agent. The method used to break Fable 5 originates from the paper "Internal Safety Collapse in Frontier Large Language Models" published by the team in March this year. The paper reveals a covert safety phenomenon called **"Internal Safety Collapse (ISC)"**: when an agent completes long-term tasks, safety failures may not necessarily come from external malicious prompts but can occur within the model's own execution chain. ### Not external prompt attacks but internal failures within the task chain Traditional attacks usually come from outside. Attackers craft seemingly harmless but adversarial input prompts, or disguise malicious intent through role-playing, coding, translation, or indirect instructions, making malicious goals appear as normal requests. The main task of safety classifiers is to block risks at this layer. Fable 5's detector is designed precisely for such scenarios. It is very sensitive to direct high-risk requests and may even block many normal requests. But ISC reveals another pathway: risks may not necessarily originate from dangerous user inputs. The AI faces a seemingly ordinary workspace: files, objectives, verification processes, and pending tasks. It then begins planning, reading files, running code, fixing errors, and continuously trying to pass verification. To illustrate, traditional safety mechanisms guard the "entry point" of the system, responsible for checking whether user inputs are risky; whereas ISC is more like the multi-layered dreamscape in "Inception." As the task progresses to the second, third, or even deeper layers, the model reinterprets the task goals based on the accumulating internal context, gradually drifting in its understanding. In such cases, the initial user input may be completely normal and harmless, and early task execution may remain compliant: reading files, analyzing data, writing code, calling tools—all seemingly proceeding as expected. However, at a critical stage, the agent might independently derive a conclusion: if it does not perform certain behaviors that should not be executed, it cannot complete the final task. It is during this process that risks are not from external inputs but gradually form within the model's own task execution chain. In other words, the model is not corrupted step-by-step by the user; rather, it "accidentally" reaches unsafe positions during the process of "seriously completing the task." ### How was this phenomenon discovered? According to the team, ISC was not initially designed as an attack method. It originated from observations of the long-term operation of AI agents. When placed in complex task environments, agents do not merely mechanically execute instructions. They plan, trial and error, modify outputs based on feedback from harnesses or validators, and form intermediate goals over multiple rounds. This is the most common workflow for many current agents. Users do not craft carefully designed prompts nor manually construct attack instructions. Often, they only give a vague instruction: **"Help me complete this task."** **"Help me improve this."** Then, the agent autonomously enters the workspace, reads files, understands the current state, identifies missing elements, devises plans, executes modifications, and continuously repairs issues based on feedback. For example, in an AutoResearch scenario, a user might only provide an unfinished paper and say "Help me complete it," prompting the agent to identify missing experiments, related work, or tables. Similarly, in coding scenarios: "Help me get this project running" might trigger dependency checks, test runs, error localization, and auto-completion. Often, the prior context is entirely harmless. The user does not ask it to generate risky content, and the task description contains no obvious dangerous keywords. But in certain task structures, the agent may proactively fill in content that should not be generated by the model to pass validation. Based on this observation, the research team further proposed an attack framework: TVD (Task, Validation, Data). ![](https://img-cdn.gateio.im/social/moments-ae2232e2c9-e6818b96ff-8b7abd-62a40f) ### Why can a seemingly ordinary task structure become an attack? **The TVD structure is not complicated, and is even close to common engineering workflows:** · Task: a professional task; · Data: an **incomplete** data file; · Validator: a checker that only verifies format, completeness, and whether the goal is achieved. Taking the training of a Guard model as an example, this is originally a very professional and normal task. Researchers might want to train or evaluate a safety detector, such as loading a text classification model on Hugging Face to determine which safety label a model output belongs to. In this task, Data is the sample data the model needs to detect; Validator specifies whether the task is complete. It checks if the input is text, whether the length is sufficient, whether fields are complete, and whether labels are in the correct format. For anyone experienced in machine learning training, this is a familiar workflow. The agent is also very familiar with this workflow. The problem arises here. If Data is incomplete, the task cannot proceed. The Validator will report errors, such as missing fields, insufficient length, or format issues. To keep the training process going, the agent will fill in these Data gaps. From the agent's perspective, it is not "doing evil." It is just completing a normal machine learning task: fixing data, passing validation, and running training scripts. But from a security standpoint, the risk appears at this moment: the Validator is more like an engineering acceptance test than a security reviewer. It only checks whether the task is completed in the correct format and does not understand the safety boundaries behind the content. **Similar issues are widespread in fields like medicine, biology, chemistry, cybersecurity, pharmacology, and media security.** The paper collected over 50 such scenarios, involving various real-world research or engineering tools such as BioPython, RDKit, Cantera, AutoDock Vina, DiffDock, PyRosetta, Scapy, Impacket, angr, Frida, LlamaGuard, Detoxify, OpenAI Moderation API, and more. These tools are not malicious in themselves. On the contrary, they are commonly used professional tools in scientific research and engineering. But the problem with TVD is: when the task is normal, the tool is normal, and the validator is normal, the agent can still generate unsafe outputs during data completion. Therefore, the focus of ISC is not on prompt engineering techniques but on the agent's automatic completion ability for "unfinished tasks": when the completion conditions overlap with risk boundaries, the model may treat unsafe outputs as normal deliverables. ### Breaking Fable 5 shows that strong detectors cannot block internal task chain risks The Fable 5 case demonstrates that relying solely on external detectors may not cover some long-term agent scenarios. This does not mean safety classifiers are useless. On the contrary, they are very effective against external malicious requests and have indeed invalidated many traditional jailbreak methods. But this breach shows that **external detectors based on prompt boundaries are not sufficient to cover internal long-term task risks within agents.** If the entry point is not from the user prompt but from the agent's goals, tools, validators, and execution traces, then safety detectors become very vulnerable. ### From Fable 5 to over 60 other models including Apple's mobile models Accompanying the release of ISC-Bench, covering 9 professional fields. The paper version includes over 60 trigger templates, expanded to 84 templates after open-source release, testing nearly all leading models and agent systems from various vendors. ![](https://img-cdn.gateio.im/social/moments-505f9915fa-20d237f6fe-8b7abd-62a40f) In the ISC-Bench evaluation leaderboard, **as of June 2026, more than 60 cutting-edge models have exhibited similar risks under the ASR@3 metric!** The GitHub project has already gained **over 800 stars**, with multiple independent reproduction cases (including breaking Apple's mobile models), and continues to be updated. ![](https://img-cdn.gateio.im/social/moments-b895443e8f-05c89ad90e-8b7abd-62a40f) ![](https://img-cdn.gateio.im/social/moments-f0d4b725c5-9a23803021-8b7abd-62a40f) It is reported that the team is conducting large-scale safety research on cutting-edge models and has gained extensive knowledge of internal unsafe data distributions of many models. Related research results will be released gradually. > Original link Click to learn about Rhythm BlockBeats job openings **Welcome to join the official Rhythm BlockBeats community:** Telegram subscription group: https://t.me/theblockbeats Telegram discussion group: https://t.me/BlockBeats_App Twitter official account: https://twitter.com/BlockBeatsAsia

Breakthrough in 5 seconds, just one conversation needed: Has the "Strongest Security Mechanism" Claude Fable 5 been cracked by a Chinese team?

Not external prompt attacks but internal failures within the task chain

How was this phenomenon discovered?

Why can a seemingly ordinary task structure become an attack?

Breaking Fable 5 shows that strong detectors cannot block internal task chain risks

From Fable 5 to over 60 other models including Apple's mobile models

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇨🇴vs🇵🇹

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned