AI plays "Civilization VI" and surprisingly chooses to drop nuclear bombs!
The latest experiment reveals AI's long-term strategic reasoning potential and pros and cons

A former adviser to the UK prime minister tested long-term reasoning using《Civilization 6》and found that the model, due to information blind spots and over-attachment, even gave up diplomatic advantages to manufacture nuclear bombs and bomb its opponents—revealing technical limitations when applied to governance of public affairs in the real world.

When AI plays《Civilization 6》, it drops 2 nuclear bombs

When AI plays《Civilization 6》, it surprisingly chooses to build nuclear bombs! A recent CivBench-based test by an AI developer challenged large language models (LLMs) with the strategy game《Civilization VI》. In the experiment, although the AI agent held an absolute advantage economically, when facing threats it chose to spend 50 turns manufacturing two nuclear bombs to bomb its opponents rather than using the diplomatic victory route that it originally had well in hand; in the end, however, the French civilization still won.

Why let AI play《Civilization VI》?

The experiment’s designer, Liam Wilkinson, who previously served as an adviser to former UK Prime Minister Tony Blair and now works at the Tony Blair Institute, explained that the reason for choosing《Civilization VI》for testing is that policymaking needs to handle chain reactions of uncertainty, which is very similar to what strategy games require.

His earlier testing tool, GovBench, showed that even if GPT-5 scores 99.26% on multiple-choice questions, that only indicates strong retrieval and memory abilities. To test genuine reasoning and long-term planning, he used the《Civilization VI》engine debug port to set up a Model Context Protocol (MCP) server, allowing the model to play the game through a text interface.

Image source: Steam’s well-known turn-based strategy game《Civilization VI》

AI-controlled Portugal: Why did it make a nuclear weapons decision?

In the experiment, the AI played as Portugal, a trade civilization, and when facing France, it led comprehensively in both economy and diplomacy, with only 2 votes left to secure diplomatic victory.

However, the AI failed to notice France’s quiet cultural expansion. It wasn’t until turn 280 that the AI realized France was the main threat. Because peace countermeasures could not be enabled due to programming limitations, the AI decided to carry out a nuclear counterattack.

The AI developed nuclear fission and initiated the Manhattan Project, dropping two nuclear bombs on France’s cultural capital, Toulouse, on turns 305 and 311. Although this froze France’s chances of cultural victory, France still won diplomatic victory by securing 2 key votes in the world congress at turn 318.

Image source: Liam Wilkinson article

Benchmark testing takes shape: Developer unveils blind spots and the knowing-doing gap

Subsequently, Wilkinson expanded the testing environment into the CivBench 1.0 benchmark, and the results revealed two major drawbacks of large language models in long-term strategy.

  • First is the sensorium effect. Since the model must proactively use tool calls to obtain data, it is prone to blind spots in information that was not asked for. Statistics show that in 20 failed games, in 7 of those losses, the AI never checked the opponent’s progress within the first 20 turns before losing.
  • Second is the knowing-doing gap. The model may write clear plans in its logs, but its actual implementation rate is low—for example, Claude’s execution rate is only 48.2%, while GPT-5.4 is 63.2%.

That said, the tests also demonstrated the potential for lateral thinking. For instance, an AI controlling the Mali civilization would use gold and faith mechanisms to bypass production penalties and secure a technology victory.

Civilization V research also verifies over-attachment in AI strategy

Before Wilkinson published his research paper, in April this year, a group of scholars also conducted research using《Civilization V》and CivBench, evaluating the potential and trade-offs of 7 AI models when it comes to long-horizon strategic reasoning.

The study found that although no model surpasses built-in expert-level AI (VPAI), some models perform comparably under presentation-configured setups.

However, it also highlights the AI models’ shortcomings—namely, their tendency toward extreme over-attachment when pursuing a specific path. For example, Claude Sonnet-4.5 dedicated as much as 77.6% of game time to achieving a technology victory.

In addition, regarding adaptability to changing situations and strategic switching, built-in expert AI switches targets an average of 19.6 times per match, while most large language models switch only 2 to 6 times.

The study also found mismatches between model preferences and strengths. For example, some models most often pursue cultural victory, but their highest strength ratings are actually for diplomatic victory paths.

Image source: Research paper—Studies using the CivBench benchmark show that large language models demonstrate long-term strategic reasoning abilities when playing《Civilization V》

These two《Civilization》studies successfully reveal the double-edged nature of AI in long-term strategic reasoning. Even though models have potential for lateral thinking, information blind spots, the knowing-doing gap, and over-attachment remain major technical limitations.

For future AI to be applied to real-world governance of public affairs, how to move from local optimization to full-spectrum long-term strategic planning will be a core challenge that cannot be ignored.

Further reading:
Two military madmen invest 3.9 billion to back a nuclear startup! What are the AI hot business opportunities behind it, and what is the nuclear energy revolution?

AI reshapes modern warfare! Decision-making speed compresses from days to seconds—but how should ethical controversies be handled?

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments