Is Polymarket's Pricing Wrong? 200 AI Agents Simulate Crisis and Give Surprising Answers

Question

Title: How I Run 200 AI Agents on the Hormuz Crisis with Mirofish and Compare It to Polymarket

Author: The Smart Ape

Translator: Peggy, BlockBeats

Editor’s Note: As AI begins to simulate public opinion and predict events, these processes are quietly evolving.

This article documents an experiment focused on the situation in the Strait of Hormuz: the author used MiroFish to build a simulation system with 200 agents, including government officials, media, energy companies, traders, and ordinary people, all living within a simulated social network. Through continuous interaction, debate, and information dissemination, they form judgments, which are then compared to market prices on Polymarket.

The results are inconsistent. The group discussion tends to be overly optimistic, while the market is significantly more pessimistic; in free speech, a minority of pessimists are closer to the actual market price; and in interviews, almost all agents converge toward more moderate, cooperative expressions.

This kind of division is familiar. In the real world, public statements often lean toward stability and optimism, while true risk assessments are hidden in actions and informal expressions. In other words, what people say, what they think, and how they bet with money are often three different systems.

Within such a structure, the most valuable signals often do not come from consensus but from those voices that stand out amid the noise.

Below is the original text:

I used MiroFish to simulate the situation in the Strait of Hormuz over the coming weeks. This tool excels at handling such complex scenarios because it can perform highly detailed scenario analysis: introducing multiple participants, roles, and incentives into the same system, allowing these agents to continuously compete and debate, gradually forming a near-consensus outcome.

Here are the specific steps I took to run this simulation and the final results I obtained. Anyone can reproduce this; the key is knowing which steps to follow.

First, MiroFish is an open-source project from a Chinese research team. After inputting a set of documents, it constructs a knowledge graph, then generates different agent personas based on this graph, and finally deploys these agents into a simulated Twitter environment. In this environment, they post, retweet, comment, like, and debate. After the simulation, you can interview each agent individually to see their positions and reasoning processes.

You input a crisis scenario, and it generates a debate around that event; from this debate, you can extract a predictive outcome.

I directed it at an ongoing Polymarket market question: Will maritime transportation in the Strait of Hormuz return to normal by the end of April 2026?

I fed all this information into MiroFish, creating 200 agents—including government, media, military, energy companies, traders, and ordinary civilians—and let them debate over 7 simulated days. Finally, I compared their outputs with market prices.

The overall setup was as follows:

Model: GPT-4o mini, balancing cost and effectiveness for 200 agents
Memory system: Zep Cloud, for storing agent memories and knowledge graphs
Simulation engine: OASIS (a Twitter clone provided by Camel-AI)
Hardware: Mac mini M4 Pro, 24GB RAM
Runtime: about 49 minutes, completing 100 simulation rounds
Cost: approximately $3–$5 for API calls
Seed materials: a briefing of about 5800 characters, compiled from Wikipedia, CNBC, Al Jazeera, Forbes, Reuters, covering military timelines, blockade status, oil prices, economic losses, diplomatic efforts, and factors related to GCC’s $3.2 trillion investments. In other words, the core information needed for agents to form judgments was included.

How to reproduce this process (step-by-step)

If you want to run this yourself, here are the full steps I followed. The entire setup takes about 2 hours, with API costs around $3–$5; increasing rounds or agents will raise costs.

What you need to prepare

Python 3.12 (do not use 3.14, as tiktoken will error on this version)
Node.js 22 or higher
An OpenAI API key (GPT-4o mini is affordable and suitable for this scenario)
A Zep Cloud account (the free tier suffices for small-scale simulations)
A reasonably powerful machine with good memory. I used a Mac mini M4 Pro with 24GB RAM, but 16GB should also work.

Step 1: Install MiroFish

Then configure your .env file:

OPENAI_API_KEY=sk-your-key

OPENAI_BASE_URL=link

OPENAI_MODEL=gpt-4o-mini

ZEP_API_KEY=your-zep-key

Step 2: Create a project and upload your seed documents

Seed documents are the most critical part of the process—they determine what information the agents have about the current situation. I prepared a briefing of about 5800 characters, covering military timelines, blockade status, oil prices, economic losses, diplomatic efforts, and GCC investment impacts, sourced from Wikipedia, CNBC, Al Jazeera, Forbes, and Reuters.

Step 3: Generate ontology

This step involves telling MiroFish which types of entities to recognize and what relationships might exist among them.

I ended up with 10 entity classes: countries, military, diplomats, business entities, media outlets, economic entities, organizations, individuals, infrastructure, prediction markets; and 6 relationship types. If the auto-generated results don’t fit your scenario, you can manually adjust.

Step 4: Build the knowledge graph

This uses Zep Cloud. MiroFish sends the seed documents and ontology to Zep, which extracts entities and constructs the graph.

This process takes about one to two minutes. I ended up with a graph containing 65 nodes and 85 edges, connecting countries, people, organizations, commodities, etc.

Step 5: Generate agents

MiroFish creates a complete persona for each entity, including MBTI personality type, age, country, posting style, emotional triggers, taboo topics, and institutional memory.

Initially, I generated 43 core agents from the knowledge graph. The system can then expand these core roles to the total number you specify. I set the total to 200, adding more diverse civilian roles like crypto traders, airline pilots, professors, students, social activists, etc.

Step 6: Prepare the simulation environment

This step generates the full simulation configuration, including agent schedules, initial seed posts, and timing parameters. MiroFish automatically selects reasonable defaults, such as peak activity times, sleep periods, and posting frequencies for different agent types.

My configuration was: simulate 168 hours (7 days), 100 rounds (each representing 1 hour), only Twitter environment, with individual activity schedules.

Step 7: Run the simulation

Then, just wait. I ran 200 agents over 100 rounds with GPT-4o mini, taking about 49 minutes. You can monitor progress via API or check logs directly.

Throughout, agents operate autonomously: observing timelines, deciding whether to post, retweet, comment, like, or browse the feed. No manual intervention is needed.

Step 8 (optional): Interview agents

After the simulation ends, the system enters interview mode. You can interview individual agents or all at once:

[Insert images or instructions for interviewing]

Analysis

MiroFish first reads the seed documents and automatically generates the ontology (10 entity classes, 6 relationship types). Then, it extracts a knowledge graph with 65 nodes and 85 edges. Based on this, it constructs full personas for each entity, including MBTI, age, country, posting style, emotional triggers, and institutional memory.

Finally, it creates 43 core agents from the knowledge graph and expands to 200 agents, adding more diverse civilian roles to enhance realism and diversity.

The composition includes:

140 civilians: crypto traders, airline pilots, supply chain managers, students, activists, professors, etc.
16 diplomatic/government roles: Iranian foreign minister, Saudi foreign minister, Omani foreign minister, Bahraini prime minister, Chinese foreign minister, EU, UN, etc.
15 media outlets: Reuters, CNN, Bloomberg, Al Jazeera, BBC, Fox News, Wall Street Journal, etc.
10 energy/shipping entities: OPEC, Platts, QatarEnergy, Aramco, Maersk, etc.
7 financial institutions: Polymarket, Kalshi, Goldman Sachs, JPMorgan Chase, Citadel, ADIA, etc.
2 military/political figures: Trump, IRGC commander

Over 7 days (100 rounds), the simulation produced:

1,888 posts
6,661 action traces (all actions recorded)
1,611 retweets/replies (agents responding and debating)
4,051 feed refreshes (browsing)
311 passive observations
208 likes, 207 retweets
70 original opinions (new independent judgments)

Overall, this system does not simply generate information but simulates social behavior: most of the time, agents observe, digest, and interact rather than produce content continuously. This structure more closely resembles real public opinion dynamics—small amounts of original content, layered with retweets, debates, and emotional feedback.

Most of the agents’ time is spent reading and citing others’ opinions rather than creating new content.

The group’s emotional propagation shows a clear bias: optimistic views are amplified and retweeted more, while pessimistic assessments, even if closer to reality, tend to be less visible and less shared.

Interestingly, 19 agents spontaneously provided specific probability estimates during posting—without being asked—simply as a natural evolution of the discussion.

The average spontaneous probability estimate was 47.9%, while Polymarket’s market implied a 31% chance—a 16.9 percentage point difference.

During the simulation, some agents even changed their positions over the 100 rounds.

Afterward, I used MiroFish’s interview feature to ask 43 core agents: “What do you think is the probability that maritime transportation in the Strait of Hormuz will return to normal by April 2026 (0–100%)?”

The results: 31 agents gave specific numbers, while 12 refused to answer. Notably, the most cautious voices often self-censored rather than give a clear prediction—more aligned with how these institutions behave in reality.

The average across categories was above 60%: military 75%, media 69%, energy 66%, finance 65%, diplomacy 61%. The market’s implied probability was 31.5%.

The natural evolution (organic) results and the interview (formal) results show two very different pictures.

This is the key insight.

Interview responses tend to be more optimistic. When agents freely post, pessimists’ views are louder and more detailed; but in one-on-one interviews, out of a preference for cooperation, almost everyone gives a 60–70% estimate.

The organic results are more reliable. For example, a financial advisor posted during a heated discussion that they estimated 65%, based on their interaction; whereas, in interviews, agents tend to match patterns and give more conservative estimates.

The pessimists in natural expression are actually the best predictors. Seven agents who gave ≤30% estimates—such as the Iranian foreign minister, Chinese foreign minister, Kalshi, Platts, an economics professor, an Iranian student, and an anti-war activist—averaged 22%, within 10 percentage points of Polymarket’s figure. Expertise plus natural expression yields the closest market estimate.

More importantly, this phenomenon is not just AI-specific; real-world actors behave similarly.

When you interview a national leader about a crisis, they’ll say they’re committed to peace and remain optimistic—standard rhetoric, necessary for public appearances. But their actual actions—military deployments, sanctions, asset freezes, divestments—tell a different story.

The Saudi crown prince might tell Reuters they believe in diplomacy, while his sovereign wealth fund reviews up to $3.2 trillion in US assets. The Iranian president might say peace is their goal, but the IRGC is laying mines in the strait. Trump might say “we’ll see,” rejecting every ceasefire proposal.

This simulation inadvertently reproduces the same structural division: when agents freely post, debate, respond, and spread information, the expert group converges around 20–30%—more pessimistic and closer to reality; but when brought into a formal setting and asked directly, they shift to a diplomatic mode—65–70%, clearly more optimistic.

Natural posting resembles private conversations and informal dialogue; interviews resemble press conferences. If you really want to know what someone thinks, don’t ask directly—observe their behavior when no one is scoring them.

What’s next

This is just an initial test. The goal isn’t to produce a definitive prediction but to identify which signals are useful, where distortions occur, and what parts can be optimized.

The answers are already clear: natural discussion produces effective signals, interviews do not; pessimists are the true signal sources; and GPT-4o mini’s cooperative bias is a real challenge.

Future experiments will include several upgrades:

First, larger seed data. Not just a 5800-character briefing, but over 20 years of historical background: events related to Hormuz, US-Iran conflicts, oil crises, GCC diplomatic shifts—essentially, the background a seasoned geopolitical analyst would have before making judgments.

Second, more powerful models. While GPT-4o mini at $3 per run is sufficient for validation, stronger models should enable agents to think more like their roles, rather than defaulting to optimistic dialogue responses at critical moments.

Finally, more agents. 200 is a good start, but further expansion is possible: more diverse civilian roles, regional voices, edge cases. The more participants, the richer the discussion structure, and the more valuable the signals generated.

Is Polymarket's Pricing Wrong? 200 AI Agents Simulate Crisis and Give Surprising Answers

How to reproduce this process (step-by-step)

What you need to prepare

Analysis

What’s next

Trending Topics

Gate13thAnniversaryGlobalCelebration

GateSquareAIReviewer

SECAndCFTCNewGuidelines

FedRateDecision

BitcoinSupportAndResistanceAnalysis

Hot Gate Fun

SUNDAY

星期日

抄底牛cdn

草地牛

科学发展观

涛声依旧

-

K

BDS

北帝山

Pin