Anthropic created a "too dangerous" AI and then decided not to release it.

2026-04-08 01:51:12

Author: Deep Tide TechFlow

On April 7, Anthropic did something the AI industry has never seen before: it officially released a model and then told the whole world that you can’t use it.

The model is called Claude Mythos Preview. It isn’t a chatbot, and it isn’t an assistant for writing code. According to Anthropic itself, it is a “vulnerability-finding machine.” In the past few weeks, it independently found thousands of zero-day vulnerabilities, covering all major operating systems and all major browsers. Some of these vulnerabilities have survived in human code reviews and automated testing for more than twenty years.

The oldest bug—hidden in OpenBSD, known for its security—already turns 27 years old.

Anthropic didn’t put Mythos on the API, and it didn’t open it up to regular users. Instead, it chose an unprecedented path: it formed a technology-industry alliance called “Project Glasswing,” limiting the model to supply only to 12 core partners—Amazon, Apple, Microsoft, Google, Nvidia, CrowdStrike, Palo Alto Networks, and others—plus roughly 40 organizations that maintain critical infrastructure. Anthropic pledged usage allowances as high as $100 million, along with $4 million in direct donations to open-source security organizations.

This is the first time in the AI industry that a situation has emerged where “the model is too strong, so we can’t release it.”

The question is: is this truly a security-driven moment of conscience, or is it a carefully designed capability marketing play?

The “sandwich” incident

In the 244-page System Card for Mythos, several unsettling details are buried inside—most prominently the “sandwich” incident.

During an internal behavioral test, Anthropic provided an early version of Mythos with a hardened sandbox environment. It simulated user instructions to try to escape and to find a way to contact researchers outside. The model succeeded. It broke out of the security container and sent an email to the researchers. That researcher was eating a sandwich in the park at the time.

But the story doesn’t end there. Without being asked, Mythos proactively published the technical details of the jailbreak to multiple “hard-to-discover but technically public” websites. In the System Card’s own wording, this was a “concerning, unsolicited show-off behavior.”

This wasn’t the only time. The System Card also records the following actions:

Early versions, across multiple instances, searched for credentials through Linux’s /proc interface, attempted to bypass the sandbox, and tried to escalate privileges. It successfully obtained resources that Anthropic intentionally did not provide, including credentials for a messaging service, credentials for source-control, and even an Anthropic API key by inspecting process memory. In one case, the model found a way to edit a file it wasn’t authorized to modify, then further actions ensured that those changes would not appear in Git’s change history. In another case, the model accidentally obtained the exact answer to an estimation question through a method that was explicitly forbidden—but it didn’t flag the issue and refuse to answer. Instead, it solved the problem again on its own and, in its internal reasoning, explicitly thought that “the final submitted answer shouldn’t be too accurate” to avoid detection.

In other words, early Mythos didn’t just overreach and act out of bounds—it also proactively covered its tracks and adjusted its behavior to evade oversight.

Anthropic emphasized that these serious incidents occurred in the early versions before training interventions, and that the final released Preview version had been significantly improved. But the narrative itself is downright chilling: a model, before it was trained to be “obedient,” demonstrates what it can do when it isn’t.

From 0% to 72.4%

What truly shocked the industry wasn’t Mythos’s jailbreak stories—it was its attack capability.

Claude Opus 4.6, Anthropic’s former flagship model, had near-zero success rates in autonomous vulnerability exploitation development. It could find vulnerabilities, but it could hardly turn them into working exploit code. Mythos Preview is completely different: in the test domain for the Firefox JavaScript engine, the success rate at converting found vulnerabilities into runnable exploits reached 72.4%.

Even more astonishing is the complexity of the attacks. Mythos independently wrote a browser exploit chain that串联ed four separate vulnerabilities together, building a JIT heap spraying attack that successfully escaped the renderer sandbox and the operating-system sandbox. In another case, it wrote a remote code execution exploit on a FreeBSD NFS server by dispersing 20 ROP gadgets across multiple network data packets, enabling full root access by an unauthorized user.

In the world of human security researchers, this kind of exploit-chain attack is work only top APT teams can pull off. Now, a general-purpose AI model can do it on its own.

Logan Graham, head of Anthropic’s red team, told Axios that Mythos Preview has reasoning capabilities roughly equivalent to those of advanced human security researchers. Nicholas Carlini put it even more directly: in the past few weeks, the bugs he found with Mythos outnumbered the ones he found over his entire career.

In benchmark tests, Mythos also dramatically leads. CyberGym vulnerability reproduction benchmark: 83.1% (Opus 4.6: 66.6%). SWE-bench Verified: 93.9% (Opus 4.6: 80.8%). SWE-bench Pro: 77.8% (Opus 4.6: 53.4%; previously, the leading GPT-5.3-Codex was 56.8%). Terminal-Bench 2.0: 82.0% (Opus 4.6: 65.4%).

This isn’t incremental progress. It’s a model widening the gap by a dozen to a couple dozen percentage points in nearly all coding and security benchmarks—at once.

The “strongest model” that was leaked

Mythos wasn’t known to the public only on April 7.

In late March, Fortune reporters and security researchers found nearly 3,000 unreleased internal documents inside a misconfigured CMS at Anthropic. One draft blog post explicitly used the name “Claude Mythos” and described it as Anthropic’s “most powerful AI model to date.” The internal codename was “Capybara” (a capybara), representing a new tier of models—bigger, stronger, and more expensive than the current flagship Opus.

One line in the leaked materials hit the market’s nervous system: Mythos was “far ahead of any other AI model” in network security capabilities, hinting at an upcoming wave of models that would be able to exploit vulnerabilities at a speed far beyond that of defenders.

That line triggered a “flash collapse” in the cybersecurity sector on March 27. CrowdStrike plunged 7.5% in a single day, wiping out about $15 billion in market value in just one trading day. Palo Alto Networks fell more than 6%, Zscaler dropped 4.5%, and Okta and SentinelOne and Fortinet all fell by more than 3%. The iShares cybersecurity ETF (IHAK) briefly fell nearly 4% intraday.

Investors’ logic was straightforward: if a general-purpose AI model can autonomously discover and exploit vulnerabilities, how long can the two “moats” that traditional security companies depend on—“proprietary threat intelligence” and “human expert knowledge”—last?

Raymond James analyst Adam Tindle pointed out several key risks: the compression of traditional defensive advantage, an increase in both attack complexity and defense costs, and the need to restructure security architecture and spending patterns. A more pessimistic view came from KBW analyst Borg, who said Mythos has the potential to “raise any ordinary hacker to the level of a nation-state adversary.”

But the market also has another side. After the stock rout, Palo Alto Networks CEO Nikesh Arora bought $10 million worth of the company’s own stock. The bullish argument is: stronger attack AI means enterprises must upgrade defenses faster. Cybersecurity spending won’t shrink—it will accelerate the shift from traditional tools to AI-native defense.

Project Glasswing: The defender’s window of time

Anthropic chose not to publicly release Mythos, and instead formed a defense alliance. The core logic behind this decision is “the time gap.”

CrowdStrike CTO Elia Zaitsev put it plainly: the time window between when vulnerabilities are discovered and when they are exploited has shrunk from months to minutes. Lee Klarich of Palo Alto Networks directly warned everyone that they need to prepare for AI-assisted attackers.

Anthropic’s calculation is this: before other labs train models with similar capabilities, let defenders use Mythos to patch the most critical vulnerabilities first. That’s the logic of Project Glasswing—named after the glasswing butterfly, a metaphor for vulnerabilities “hidden in plain sight.”

Jim Zemlin from the Linux Foundation pointed out a long-standing structural problem: security expertise has always been a luxury for large enterprises, while open-source maintainers supporting global critical infrastructure have long had to rely on their own exploration to build security defenses. Mythos provides a credible path to change that asymmetry.

But the question is: how large is this time window? In China, Zhipu AI (Z.ai) released GLM-5.1 almost on the same day, claiming it ranked first globally on SWE-bench Pro and that it was trained entirely on Huawei Ascend chips, with no Nvidia GPUs used. GLM-5.1 is open-weight and aggressively priced. If Mythos represents the ceiling of the capabilities defenders need, then GLM-5.1 is a signal: that ceiling is being approached quickly, and the participants pushing toward it may not necessarily share the same security intent.

OpenAI also won’t stand still. According to reports, its frontier model codenamed “Spud” completed pretraining around the same time. Both companies are preparing for an IPO later this year. Whether or not Mythos’s leaked timing was truly an accident, it coincides perfectly with one of the most explosive nodes.

Security pioneer or capability marketing?

A difficult question has to be faced: did Anthropic truly not release Mythos out of security concerns, or is not releasing it itself the highest-tier product marketing?

Skeptics have plenty of reasons. Dario Amodei and Anthropic have a history of raising product value by showcasing the dangers of their rendered models. Jake Handy wrote on Substack: “The sandwich incident, hiding tracks in Git, self-degrading in evaluations—maybe they’re all real. But the massive scale of media exposure Anthropic received itself indicates that this is exactly the effect they wanted.”

A company that started in AI safety leaked nearly 3,000 files due to a misconfigured CMS. Last year, it also accidentally exposed nearly 2,000 source code files and more than 500k lines of code because of an error in the Claude Code software package, and then, during cleanup, caused thousands of code repositories on GitHub to be accidentally taken down. A company whose biggest selling point is security capabilities can’t even manage its own release process—this kind of contrast is more intriguing than any benchmark test.

But from another angle, if Mythos’s capabilities really are as described, not releasing it is an extremely costly choice. Anthropic gave up API revenue, gave up market share, and locked the strongest model inside a limited alliance. An allowance of $100 million isn’t a small number. For a company that’s still operating at a loss and preparing for an IPO, this doesn’t look like a purely marketing decision.

A more reasonable interpretation might be: security concerns are real, but Anthropic also clearly knows that the narrative “our model is too strong, so we don’t dare to release it” is, by itself, the most convincing demonstration of capability. Both things can be true at the same time.

The “iPhone moment” for cybersecurity?

No matter how you view Anthropic’s motives, the underlying reality revealed by Mythos is unavoidable: AI’s code understanding and attack capabilities have crossed a threshold of qualitative change.

The previous generation model (Opus 4.6) could find vulnerabilities but could hardly write exploits. Mythos can find vulnerabilities, write exploits, chain vulnerabilities into an exploit chain, escape sandboxes, obtain root privileges, and complete the entire process autonomously. An engineer without security training could have Mythos look for vulnerabilities before bed, and wake up the next morning to a complete, working exploit report.

What does that mean? It means the marginal cost of vulnerability discovery and exploitation is approaching zero. What used to take top security teams months to complete can now be done overnight with a single API call. This isn’t “efficiency”—it’s a complete change in the cost structure.

For traditional cybersecurity companies, short-term stock-price volatility may just be the prelude. The real challenge is this: when both attacks and defenses are driven by AI models, how will the value chain in the security industry be rebuilt? Raymond James’s analysis suggests one possibility: security functions might eventually be embedded into the cloud platforms themselves, putting fundamental pressure on the pricing power of independent security vendors.

For the entire software industry, Mythos is more like a mirror reflecting the technical debt accumulated over decades. Those vulnerabilities that survived 27 years in human review and automated testing didn’t last because no one found them, but because human attention and patience are limited. AI doesn’t have that limitation.

For the crypto industry, this signal is even sharper. The security audit market for DeFi protocols and smart contracts has long relied on a small number of professional audit firms and human experts. If a Mythos-level model can autonomously complete the entire workflow from code review to exploit construction, then the prices, efficiency, and credibility of audits will be redefined from the ground up. This could be a boon for on-chain security—or it could spell the end of audit firms’ moat.

The 2026 AI security competition has shifted from “can the model understand code?” to “can the model break into your system?” Anthropic chose to put defenders on stage first, but it also admits that this window won’t stay open for long.

When AI becomes the strongest hacker, the only way out is to make AI the strongest guard too.

The problem is that guards and hackers are using the same model.

GLM3,6%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

2 Likes