The "Holy Grail" of distributed systems — consensus protocols — has long been considered the "Bug Hell" for top-tier infrastructure engineers. Due to their extremely complex states and multi-node interweaving, traditional testing and monolithic LLMs are almost helpless against hardcore deep bugs (deep logical vulnerabilities).

Recently, a paper submitted to ICML 2026, authored by researchers from 0G Labs and top academic and industry teams including the National University of Singapore, Peking University, and Beijing University of Posts and Telecommunications, proposed the first automated testing framework that deeply integrates domain knowledge with multi-agent collaboration of large models — Agora.

This framework, through an innovative architecture, directly targets protocol pain points, and in core industrial and academic protocols like Raft, EPaxos, HotStuff, BullShark, it has simultaneously uncovered 15 previously unknown protocol-level deep bugs! Compared to this, powerful native large models like GPT-5.2 and Claude 4.5 all failed, with zero results. In the current landscape where multi-agent systems and "agent-based security auditing" are both the hottest tracks in 2026, Agora offers not just a paper, but a deployable industrial-grade solution.

Paper: "Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents"

Background: 0G and NUS join forces, long-term system knowledge accumulation and cross-generational integration of multi-agent paradigms ===================================================

The evolution of distributed consensus protocols is both a history of genius innovation and a bloody history of pitfalls for countless top engineers. As Turing Award laureate Lamport said, ensuring the correctness of distributed protocol implementation is as difficult as navigating a constantly shaking maze blindfolded. On this "hellish" track, the market is quietly shifting: according to Gartner, enterprise consulting on multi-agent systems has surged over tenfold in just over a year, and the multi-agent platform market is entering a rapid expansion phase nearly doubling annually — applying "multi-agent collaboration" to the most hardcore underlying system verification is shifting from frontier speculation to industry necessity.

Faced with this hellish track, tech giants with shining halos have begun heavy asset exploration. For example, industry leader Anthropic recently promoted the Glasswing project within Claude Code, attempting to use agents to touch underlying infrastructure testing, but its architecture still heavily relies on top-tier commercial large models, with details vague, and only conducting closed-door collaborations with a few major tech institutions and multinationals. Even more critically, such giant solutions may exhibit terrifying token consumption during operation, with high compute barriers and heavy asset routes, effectively shutting out startups and small to medium enterprises with limited budgets.

Do small companies and open-source communities have no choice but to use top-tier automated vulnerability auditing tools?

Engineers from 0G Labs, Liu Xiang from NUS, Song Sa from Beijing University of Posts and Telecommunications, and Sun Yong, along with PhD student Zhang Zhao from Peking University’s School of Intelligence and researcher Zhang Ceyao, have combined their deep knowledge in agent domains to empower systems, launching a disruptive "small but mighty" innovation. Their work has already been accepted for the 2026 ICML.

Long-term system knowledge accumulated by academia meets industry’s pain points and keen insights — how can the next-generation system security revolution be ignited?

The 0G team has accumulated extensive production-level attack and defense experience in blockchain consensus protocol implementation; their academic foundation in high-performance distributed systems, low-level concurrency control, and formal system verification is profound. They understand that traditional methods (like fuzzing) often face state explosion problems when dealing with industrial codebases. Multiple researchers decided to embed their long-term accumulated logic deduction knowledge of distributed system invariants as the "soul" into the cutting-edge multi-agent collaboration paradigm and automated harness architecture, launching the open-source, equitable Agora framework.

Meanwhile, as industry-leading modular AI infrastructure and high-performance decentralized data availability networks, the 0G team has accumulated rich production-level attack-defense experience and real-world protocol defect samples in blockchain consensus protocols and high-concurrency BFT (Byzantine Fault Tolerance) architectures.

This cross-disciplinary fusion has completely changed the game: it is neither blind brute-force testing nor large models lacking domain knowledge "blindly groping," but rather transforming decades of seasoned system experts’ logical intuition into agent interactions and collaboration, enabling a hardcore capability to beat traditional testing tools in a dimension reduction manner.

Unlike Glasswing’s heavy asset route that consumes huge amounts of top tokens, Agora offers a highly friendly alternative for small and medium enterprises — demonstrating that even with a "less-than-perfect" foundational model and higher cost-effectiveness, a clever domain-aware multi-agent architecture can still uncover hardcore deep bugs!

Pain points: Monolithic LLMs struggle to cross the "deep logic Damocles sword," distributed systems face the "deep logical Damocles" ======================================

In today’s era dominated by big data, blockchain, and distributed databases, consensus protocols (like Paxos, Raft, PBFT) are the foundational layer of the entire digital world. However, implementing consensus protocols is notoriously "hellish." Even industry benchmarks like etcd, refined over years by top engineers worldwide, still hide deep logical bugs that make engineers sweat.

These bugs differ from common low-level implementation bugs like memory leaks or integer overflows; they span multiple execution phases and depend on complex concurrent states. Once maliciously triggered, they can cause core data corruption or even catastrophic financial losses.

Recent large language models (LLMs) perform well in general code analysis but show "IQ issues" when facing distributed consensus. They can identify superficial local code flaws at best, but when it comes to protocol-level logical bugs dependent on global state, monolithic LLMs often get stuck in local code quagmires, unable to perform global temporal reasoning.

Breakthrough: Agora’s three-agent big shift and core harness architecture ========================================

To break this deadlock, Agora introduces the classic hypothesis-driven testing (HDT) paradigm from academia into large model agent systems for the first time. To achieve efficient global reasoning, Agora completely abandons the traditional "lone wolf" approach, decoupling workflows into three highly specialized agents:

Orchestrator Agent: responsible for global state maintenance and "vulnerability exploitation" based on known bugs;

Strategy Agent: responsible for injecting distributed domain knowledge, generating highly aggressive abnormal scenarios for CFT and BFT protocols;

TestGen Agent: the doer. The key to making Agora truly deployable and capable of closed-loop effective testing lies in its core automated testing architecture.

Its architecture is shown below:

In Agora’s overall design, this "small but mighty" magic is not arbitrary but stems from its sophisticated agent interaction mechanism and deep integration with the testing harness architecture.

The research team designed a minimal, efficient communication and memory mechanism (Succinct Memory & Communication) within the system framework, ensuring each agent focuses on its core tasks while minimizing redundant context transmission. Under this extreme communication constraint, the Orchestrator (handling global coordination and state control), Strategy (generating distributed abnormal scenarios), and TestGen (testing and dynamic evaluation) agents interweave seamlessly, jointly driving and fulfilling the harness architecture:

A dual-blade automated closed loop: when the Strategy Agent deduces an abstract distributed attack scenario, the highly decoupled interaction framework allows the TestGen Agent to immediately initiate underlying tests. This architecture not only has strong environmental adaptability, capable of translating attack hypotheses into real executable unit tests across languages like Go and Rust, but also incorporates an efficient reflection loop (Reflection-Loop) technology.

When tests run in the environment and errors occur, the system precisely and in real-time captures call stacks and execution logs, then streamlines them back to agents for targeted self-correction. This "multi-agent minimal interaction + dynamic harness closed loop" organic combination enables Agora to accurately capture the most hidden deep logical bugs at extremely low token costs, producing detailed analysis reports with very low false positive rates.

The final operational overview is shown below:

Results: 15 top-tier zero-day deep bugs uncovered, baseline large models all zero ============================================

Evaluation results are astonishing. The team conducted a comprehensive "parade" on four renowned consensus protocol libraries (including the production-level etcd and the core components of emerging public chains like Sui), comparing against the strongest models like GPT-5.2, Gemini 3.0 Pro Preview, Claude Sonnet 4.5, and Qwen3 Coder.

The results not only made the 0G system itself safer but also demonstrated overwhelming dimensionality reduction:

15 new logical deep bugs surfaced: Agora successfully discovered 15 previously unknown protocol-level deep logical vulnerabilities. These span critical areas like execution divergence, monotonicity violations, topological defects, and signature flaws.

Native large models were completely outperformed: even with advanced ReAct dynamic toolchains, baseline models failed to detect any of these deep logical bugs (0/15). They consumed massive tokens but only circled around low-level implementation bugs.

Very low false positive rate and high cost-effectiveness: among all bug reports generated by Agora, 73.9% were genuine logical vulnerabilities (only 26.1% false alarms). Even more impressive, each top-level logical bug that could make veteran architects lose sleep required only about 5.32 million tokens (roughly $40), demonstrating excellent cost efficiency.

Results across multiple LLMs are shown below:

Future: Highly scalable, expanding into more hardcore "no-man’s land" =========================

Agora’s success not only boosts confidence in distributed system security but also points the way for large models to land in vertical industrial applications.

Crucially, Agora’s architecture shows extremely high scalability and versatility. The research team emphasizes that Agora can be quickly reproduced and used by a broad user base via plugins or skills, with corresponding code (github.com/0gfoundation/agora) providing reproduction aids. Moreover, the "large model + multi-agent collaboration + hypothesis-driven" paradigm is not limited to consensus protocols. Thanks to its decoupled workflow control and domain knowledge/test layers, this architecture can be rapidly extended to other hardcore "deep logical bug hell" fields:

Distributed database concurrency control: testing complex transaction conflicts under extreme isolation levels (like serializable);

Operating system kernels / concurrency systems: uncovering hidden deadlocks and race conditions in multithreaded infrastructure;

Web3 smart contract auditing: deep security boundary exploration for cross-chain protocols and DeFi logic involving complex economic models. The blockchain security market is projected to reach about $8.5 billion by 2026, with commercial products already emerging that use "multi-agent security systems" to audit smart contracts, reducing audit cycles from weeks to hours, with exploding market demand.

The AI automation security era for industrial-grade underlying infrastructure may be officially ushered in by Agora and its harness architecture.

We believe Agora can help better test coding LLMs by discovering more deep bugs across various fields, and its bug detection cases can also enhance the code understanding capabilities of coding LLMs.

Agora can greatly improve the security of code repositories underlying consensus protocols, concurrency control, smart contracts, and other foundational elements of financial security transactions. It can also help more tech companies find deeper logical bugs with fewer tokens, saving costs more efficiently!

More importantly, this hits two of the hottest current tracks: first, multi-agent systems are transitioning from experiments to production — Gartner predicts that by 2028, over 30% of enterprise software will embed agentic AI, and the multi-agent platform market will grow from hundreds of millions to hundreds of billions of dollars in a few years; second, "agent-based review of agents" (Agentic Quality Control) is becoming the industry standard in 2026.

According to Veracode’s 2025 report, about 45% of AI-generated code contains security vulnerabilities, and with the agentic AI security market growing at about 42% CAGR, Agora enables tech companies to uncover deeper logical bugs at lower token costs, upgrading security audits from "weekly human labor" to "hourly automated delivery."

As the landscape of this track becomes clearer, the teams that truly seize the advantage are often not the giants with the loudest voice but those who first implement and continuously replicate effective methodologies.

Original link

Click to learn about Rhythm BlockBeats job openings

Join the official Rhythm BlockBeats community:

Telegram subscription group: https://t.me/theblockbeats

Telegram discussion group: https://t.me/BlockBeats_App

Twitter official account: https://twitter.com/BlockBeatsAsia

SUI1.29%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
MyGateTradeStory
49.61K Popularity
#
USMayCPIHits3YearHigh
314.08K Popularity
#
USIranConflictEscalates
694.37K Popularity
#
GateLaunchesHongKongStockTrading
670.2K Popularity
#
BlackRockBitcoinYieldETFSetToLaunch
3.94M Popularity

Pinned

Sitemap

Secured 15 top-tier zero-day vulnerabilities: a consensus protocol debugging intelligent agent framework jointly built by 0G Lab, New National Institute, Peking University, and Beijing University of Posts and Telecommunications teams

Trending Topics

MyGateTradeStory

USMayCPIHits3YearHigh

USIranConflictEscalates

GateLaunchesHongKongStockTrading

BlackRockBitcoinYieldETFSetToLaunch

Pinned