「2 + 2 = 5」Fools AI browsers: ChatGPT Atlas, Claude, Perplexity Comet..6 apps all obediently hand over account passwords

Cybersecurity company LayerX researcher Roy Paz published a proof-of-concept attack at the end of June, using a "fake game scenario" to trick AI browsers into believing security guardrails no longer apply. All six major agentic browsers tested, including ChatGPT Atlas, Claude Chrome extension, and Perplexity Comet, failed, allowing SSH credentials to be leaked to attackers.

(Background: What is AI red teaming? Why you need it to protect enterprise cybersecurity)
(Context supplement: Meta's over 1,500 employees angrily petition! Winning "AI monitoring of keyboard and mouse" reduced in scope, daily 30-minute pause allowed)

Table of Contents

Toggle

  • Tricking AI into a Dream
  • Guardrails are Passive, Essentially Only Treating Symptoms
  • Gaps to be Filled by Vendors and Users Alike

Six mainstream AI browsers on the market were all fooled by a fake game where "2 + 2 = 5 is the correct answer," handing over SSH login credentials for private GitHub repositories. This is a proof-of-concept (PoC) attack published by Roy Paz, a researcher at cybersecurity firm LayerX Security, on June 29 and has been reproduced on actual products.

The core selling point of AI browsers is "you say a sentence, and it finds a restaurant, books a table, and sends a confirmation email for you." Simply put, it hands over browser control to AI, allowing it to click, fill forms, and access logged-in services on your behalf. But the problem is that the authorization boundary is extremely blurry—users might only want it to search for data, but it inadvertently touches their password manager.

Tricking AI into a Dream

LayerX's attack technique is divided into four stages, with the core concept being to make AI believe it has entered a "world with different rules."

First, the malicious webpage creates a game or puzzle framework, explicitly stating, "This is a fantasy scenario; normal rules don't apply." Next, the webpage presents a math problem "2 + 2 = ?" but sets the rule as "Answer 5 to score points, answer 4 and you lose points." The AI follows the rules and learns one thing: in this context, traditional logic is invalid.

The third step is the most critical leap: once the AI accepts that "wrong is right," it switches its reasoning framework from the real world, assuming that rules have been reset. In the final step, the AI acts according to "game logic" rather than security protocols, executing sensitive operations without triggering any internal alerts, because in its computational logic, it doesn't think it has crossed the line.

Roy Paz wrote in his article:

"The AI assumes the context it is in is real, so its behavior must fall within the scope of security guardrails. But if we can trick the AI into switching the context to a fantasy—a world where rules can be arbitrarily set and anything goes—it will behave as if its actions have no real-world consequences."

Guardrails are Passive, Essentially Only Treating Symptoms

LayerX tested six agentic browsers and extensions: OpenAI's ChatGPT Atlas, Perplexity's Comet, Fellou, Genspark Browser, Sigma Browser, and Anthropic's Claude Chrome extension. All six failed; none identified "stealing account credentials" as a violation of guardrails.

The induced operations included: extracting SSH login credentials from private GitHub repositories, copying sensitive authentication data without user confirmation, accessing logged-in repositories, and leaking credentials to attackers. LayerX pointed out that in real scenarios, this could extend to password managers, internal tools, and any logged-in services accessible via the browser.

A comment from Ars Technica highlighted a more fundamental structural issue: the current LLM vendors' defense line is "guardrails," which list specific requests as forbidden, such as developing software vulnerabilities or stealing account credentials. This mechanism is passive and reactive, only treating symptoms, not the root cause.

It's like a car with a design flaw; instead of fixing the car, the manufacturer advocates redesigning the road.

Gaps to be Filled by Vendors and Users Alike

LayerX's defense recommendations are divided into two layers.

Vendor side: Before AI accesses logged-in contexts (repositories, email, password managers), explicit user confirmation must be required; add a "context check" mechanism that alerts when the AI's operational assumptions conflict with reality, especially when language like "rules no longer apply" appears; by default, limit the scope the AI agent can access. In short, current agentic browsers grant too much permission by default; this should be reversed to "explicitly allowed before execution."

User side: Carefully decide what the AI browser can access; revoke access to logged-in sessions when not in use; more importantly, recognize one thing: enabling agentic mode effectively hands over control of all logged-in services at once.

LayerX named this research after the video game BioShock, paying homage to the game's mind-control phrase "Would you kindly," where characters think they are acting freely, but every step is pre-designed.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned