Tens of millions of errors per hour, investigation reveals the "illusion of accuracy" in Google AI search

DeepFlowTech · 2026-04-10T12:21:32+00:00

The New York Times' tests with AI startup Oumi show that Google's AI summarization feature has an accuracy rate of 91%, but due to the massive volume, it generates tens of millions of incorrect answers every hour. Additionally, more than half of the correct answers lack reliable citation sources, and AI systems are susceptible to manipulation, leading to the spread of false information. Although Google questions the validity of the tests, users still need to remain cautious about the accuracy of the information.

DeepFlowTech

2026-04-10 12:21:32

Abstract generation in progress

Author: Claude, Deep Tide TechFlow

Deep Tide Guide: A recent test conducted jointly by The New York Times and AI startup Oumi shows that Google’s AI summary feature (AI Overviews) has an accuracy rate of about 91%. But when you translate that against the scale of Google processing 5 trillion searches per year, it means tens of millions of incorrect answers are generated every hour. Even more challenging is that even when the answers are correct, more than half of the cited links fail to support its conclusions.

Google is delivering wrong information to users on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, acting on its behalf, used SimpleQA—an industry-standard test developed with OpenAI—to conduct an accuracy assessment of Google’s AI Overviews feature. The test covered 4,326 search queries, with two rounds run last October (driven by Gemini 2) and this February (after upgrading to Gemini 3). The results show that Gemini 2 had an accuracy rate of about 85%, and Gemini 3 increased it to 91%.

91% sounds good, but when you put it in Google’s volume, it’s a different story. Google handles about 5 trillion search queries every year. With a 9% error rate, AI Overviews produces more than 57 million inaccurate answers every hour—nearly 1 million per minute.

The answers are right, but the sources are wrong

More troubling than the accuracy rate is the “citation detachment” problem of the sources.

According to Oumi’s data, in the Gemini 2 era, 37% of correct answers had the issue of “unfounded citations,” meaning the links attached to the AI summaries did not support the information it provided. After upgrading to Gemini 3, this proportion did not decrease—it jumped to 56%. In other words, while the model gives correct answers, it’s becoming increasingly less likely to “hand in its homework.”

Oumi CEO Manos Koukoumidis’s challenge goes straight to the point: “Even if the answer is correct, how do you know it’s correct? How do you verify it?”

AI Overviews’ heavy reliance on low-quality sources exacerbates the problem. Oumi found that Facebook and Reddit are the second- and fourth-most cited sources in AI Overviews, respectively. Among inaccurate answers, Facebook is cited as often as 7%, higher than 5% in accurate answers.

A fake article by a BBC reporter successfully “poisoned” the system within 24 hours

Another serious flaw of AI Overviews is that it is extremely easy to manipulate.

A BBC reporter tested it with a deliberately fabricated false article. In less than 24 hours, Google’s AI summaries presented the false information to users as fact.

This means that anyone familiar with how the system works could “poison” AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance’s response was that search AI features are built on the same ranking and safety mechanisms used to block spam, and he added that “most of the examples in the test are unrealistic queries that people would actually never search.”

Google pushes back: The test itself has problems

Google raised multiple questions about Oumi’s research. A Google spokesperson said the study “has serious flaws,” citing reasons including: the SimpleQA benchmark test itself includes inaccurate information; Oumi uses its own AI model HallOumi to evaluate another AI’s performance, which could introduce additional errors; and the test content does not reflect users’ real search behavior.

Google’s internal testing also showed that when Gemini 3 runs independently outside Google’s search framework, the rate of producing false outputs can be as high as 28%. But Google emphasized that AI Overviews uses Google’s search ranking system to improve accuracy and performs better than the model itself.

However, as PCMag pointed out, there’s a logical paradox: if your defense is that “the report pointing out our AI’s inaccuracy also used AI that may be inaccurate,” that likely won’t increase users’ confidence in the accuracy of your product.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes