Predicting the World Cup knockout rounds, AI levels vary so much?

robot
Abstract generation in progress

Original | Odaily Planet Daily (@OdailyChina)

Author | Asher (@Asher 0210)_

Every World Cup match day, I ask AI to predict the outcome beforehand, and almost every model sounds confident and full of details.

Some talk about each team’s market value, some break down group-stage data, some analyze injuries and tactics, and some even directly provide scripts for the scoreline, extra time, and penalty shootouts. At first glance, ChatGPT, Grok, Qianwen, DeepSeek, Gemini, and Claude all seem to “get” football.

But as a prediction market user, what I truly care about isn’t which model is more comprehensive—it’s which one is more worth referring to.

As the World Cup enters the knockout stage, Odaily Planet Daily started from the very first match: before each game, it asked different AI models the same questions as consistently as possible, and after each game, it compared the answers with the real results—identifying which models merely analyzed things convincingly and which models truly caught the match’s direction in advance.

So far, in the knockout matches that have already finished, Canada edged out South Africa 1:0 in a sudden-death win, Brazil narrowly beat Japan 2:1, Germany were eliminated after being dragged into a penalty shootout by Paraguay, and the Netherlands also went out to Morocco on penalties. Then in Belgium vs Senegal, the uncertainty was turned up to the max: the match finished 2-2, and the result was decided by a comeback in extra time.

DeepSeek and Gemini crowned themselves with a prediction of Morocco

For now, the most memorable predictions still come from DeepSeek and Gemini for the Netherlands vs Morocco match. Before that game, it was actually easy to pick the wrong side—on paper, the Netherlands were stronger and their lineup was more complete. Many models knew Morocco would be difficult to play against, but in the end, most still trusted the Netherlands to get through.

What’s impressive about DeepSeek and Gemini is that they didn’t stop at “this match will be tight”—they also wrote the script that followed. Before the match, Gemini directly predicted 1-1 in regular time, and a Morocco win in the penalty shootout. The result: the game really ended 1-1, and Morocco eliminated the Netherlands 3-2 on penalties. This wasn’t just guessing the direction correctly—it was also basically spot-on about how the match would be dragged into penalties, and who would ultimately be laughing.

Gemini’s prediction for Netherlands vs Morocco

DeepSeek was also very close. It judged that regular time would most likely be 1-1 or 0-0, that the match could drag into extra time and even penalties, and it leaned toward Morocco advancing as an upset through defense and counterattacks.

DeepSeek’s prediction for Netherlands vs Morocco

After this match, the presence of DeepSeek and Gemini was immediately maximized. Especially Gemini—this time, it felt less like making a pre-match prediction and more like it had already read the match script in advance.

Grok and Qianwen hit specific scores in succession, with stability stronger than expected

Besides the standout performance by DeepSeek and Gemini in the Morocco match, Grok and Qianwen also weren’t without impact. Their most notable strength is that in some matches where the win/loss direction is relatively clear, they not only got the advancing team right, but also predicted the specific scorelines fairly closely to the final results.

South Africa vs Canada is one example. Before the match, most AI models favored Canada, but the disagreement was whether Canada would win comfortably. Grok predicted Canada to win 1-0, and Qianwen also predicted a one-goal margin. In the end, Canada really did advance with just 1 goal, not the kind of big win people might have imagined.

Qianwen’s prediction for South Africa vs Canada

Brazil vs Japan was similar. Most AI models thought Brazil was stronger, but whether Japan could keep the match tight was the key to this game. Both Grok and Qianwen predicted a 2-1 scoreline, and the match did indeed finish with Brazil escaping with a 2-1 win. What they got right wasn’t simply “Brazil will win”—it was that Japan could create enough trouble for Brazil.

In Côte d’Ivoire vs Norway, both were also pretty accurate. Norway has Haaland, so it’s not hard to understand the advancing direction, but Côte d’Ivoire’s physicality and wide-channel attacks wouldn’t make it one-sided. Both Grok and Qianwen predicted Norway to win 2-1, and the final score landed exactly within that “script.”

Gork’s prediction for Côte d’Ivoire vs Norway

The advantage of Grok and Qianwen is that they look at the favorite matchups in more detail. They didn’t pre-write a big script like Morocco eliminating the Netherlands, but in matches such as Canada, Brazil, Norway, and France, they gave relatively solid predictions for both the win/loss direction and where the score would land. In other words, they may not be the best at spotting upsets, but they’re good at judging whether a favorite team will cruise through or scrape by.

ChatGPT didn’t have many perfect “exact-score” picks, but its analysis of the match process was fairly accurate

ChatGPT didn’t predict, ahead of time like Gemini, that Morocco would eliminate the Netherlands on penalties, nor did it repeatedly hit several specific scorelines like Grok and Qianwen. But its strength—before many matches that appeared to favor the stronger teams—was that ChatGPT would be more likely to explicitly remind you that the game might not be as easy as it looked.

Brazil vs Japan is a good example. ChatGPT predicted Brazil would advance, but it didn’t write the match as Brazil cruising and crushing things. Instead, it pointed out that Japan’s pressure, running, and discipline would make life difficult for Brazil, and even suggested Japan might score first or equalize. Côte d’Ivoire vs Norway was similar: ChatGPT predicted Norway would advance, but said in advance that it wouldn’t be an easy match—citing Côte d’Ivoire’s physicality, wide-channel attacks, and ability to transition as factors that could cause trouble.

Also, in the knockout match between England and the Democratic Republic of the Congo, ChatGPT didn’t simply say England would win big. It believed the match might be relatively dull, with the Democratic Republic of the Congo using low blocks to slow the tempo. In the end, England did advance, but they didn’t win comfortably.

ChatGPT’s prediction for England vs the Democratic Republic of the Congo

ChatGPT’s strength lies not in getting the score exactly right every time, but in often being able to say in advance where the resistance in the match will be. It’s great for understanding the match, but it’s only suitable for predicting a single final score. It can describe the process fairly accurately, but when it comes to writing out a major upset, it still lacks a bit of decisiveness.

Germany eliminated, becoming a collective “AI model crash” scene

If earlier matches still made it possible to see each model’s individual highlights, then Germany vs Paraguay was a full collective failure.

Before the match, all AI models sided with Germany. ChatGPT, Grok, Qianwen, Gemini, and Claude all picked Germany, and their score predictions mostly clustered around 2-0, 3-0, or 3-1. The reasoning was also consistent: everyone believed Germany had stronger “on-paper” strength, better squad depth, and more attacking firepower.

But the match went wrong. The AI models underestimated Paraguay’s ability to drag the game into a muddy, uncomfortable situation. Germany couldn’t settle the match during regular time, couldn’t break the deadlock in extra time either, and in the end they were dragged into a penalty shootout—only to be eliminated by Paraguay.

Who’s the most accurate so far?

Looking at the knockout matches that have finished so far, the different models’ characteristics are beginning to show.

DeepSeek and Gemini have the most standout moments. They don’t only predict the advancement of popular teams like Brazil and France—they also provide highly valuable answers in tougher-tojudge upset matches. In the Netherlands vs Morocco match, their most critical advantage was that they dared to write ahead Morocco’s upset and penalty shootout script. Especially Gemini, which directly predicted Morocco advancing on penalties—this one was indeed especially impressive.

Grok and Qianwen are more like “score-type players.” They hit a number of specific scorelines, especially performing well in matches like Canada, Brazil, Norway, and France. But the problem is that when facing traditional strong teams like Germany and the Netherlands, they still leaned toward the favorites.

ChatGPT and Claude are more like “analysis-type players.” Their reasoning is more complete, their directions are mostly not off, and they can also flag some extra-time risks. But the problem is that they can often tell a match won’t be easy, yet they’re not bold enough to write the conclusion on the upset side. The Netherlands vs Morocco match was like that: they clearly saw the risks of extra time and penalties, yet they ultimately still trusted the Netherlands.

So instead of rushing to ask which model understands football best, it’s better to look at what scenarios each model is best suited for.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned