How to make AI programs run slower but more accurate: multi-model PR review to minimize the bug probability

Former Microsoft senior engineer Nolan Lawson uses three models—Claude, Codex, and Cursor Bugbot—to review PRs simultaneously, cross-verifying to reduce false positives to nearly zero.
(Background: Claude Code announces a 50% weekly token usage limit increase! Anthropic is competing for developer ecosystem dominance over two months.)
(Additional context: Stripe launches AI Agent fully automated payment testing: supporting USDC payments on the Base chain via x402.)

Table of Contents

Toggle

  • LLMs are inherently good at finding bugs
  • Cross-verification logic of multi-model review
  • Speed decreases, quality improves

We know that the advantage of AI coding is “rapidly generating large amounts of code,” but accuracy remains questionable. Former Microsoft and Salesforce senior engineer Nolan Lawson recently documented a new workflow on his blog: he uses multiple large language models to review each pull request (PR, in simple terms, the action of submitting new code to a project), aiming for cross-verification to identify real bugs rather than just quickly producing more code.

This process doesn’t increase his code output, but significantly improves code quality.

LLMs are inherently good at finding bugs

Anthropic’s Glasswing project launched this year (a public update to the Mythos system) provides direct data support for this logic.

This system enables large-scale scanning of real open-source code by LLM agents. The result: after scanning over 1,000 open-source projects, the system estimates it found 6,202 high-severity or critical vulnerabilities, totaling 23,019 vulnerabilities (including low severity). Among 1,752 vulnerabilities verified individually by independent security firms, 90.6% were confirmed as real issues, and 62.4% were classified as high severity or critical.

These numbers indicate a fundamental shift: finding bugs is no longer the bottleneck; verification and fixing are.

Anthropic explicitly states in their report: “Progress in software security, once limited by the speed of vulnerability discovery, is now limited by the speed of verification, disclosure, and patching.” In other words, AI has shifted the bottleneck from “discovery” to “handling capacity.”

Cross-verification logic of multi-model review

Lawson’s core approach is to run multiple models from different providers simultaneously for PR review, rather than relying on a single model.

His toolkit includes Claude Code, OpenAI’s Codex, and Cursor Bugbot, which independently review the same pull request in parallel, then aggregate all results, outputting them sorted by severity levels: critical, high, medium, low.

A key feature of this multi-model cross-verification design is: single models are prone to false positives, but when multiple models trained on different data and architectures point to the same issue, the false positive rate drops sharply, coverage increases simultaneously. As Lawson puts it: “False positive rate approaches zero, and bug coverage is very high.”

His decision process is quite clear. All critical and high issues must be fixed first; medium and low issues are evaluated individually based on the “repair cost” versus “actual impact” ratio—if not worth fixing, they are skipped to save development resources. If a PR has too many critical issues, the entire PR is abandoned and redone, rather than patching over fundamental problems.

Lawson’s core review technique is based on an analysis of multi-model performance in code review: the more diverse the models, the more accurate the final report. The underlying principle is “multi-model bias reduction”: models trained on different backgrounds have different biases toward the same code, and majority voting can effectively filter out the blind spots of any single model.

Speed decreases, quality improves

After adopting this workflow, Lawson’s actual results are: code output (lines) did not increase; in fact, he often uncovered existing bugs, forcing him to write unit tests—automated tests verifying small functions individually. Fixing old issues often takes longer than developing new features.

This isn’t the outcome he expected, but from another perspective, it signals that the codebase’s health is being systematically strengthened.

Lawson calls this working style “more textured vibe coding”: cautious, methodical, quality-oriented.

While developer tools often emphasize “speed” as the main selling point, the real problem engineers need to solve is never just speed. Every line of code has maintenance costs and a probability of issues. Using AI may slow down coding, but it makes each line more durable and less likely to cause problems over time.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments