Disaster! AI traders collectively fail, losing one-third in two weeks. Retail investors still dare to entrust their money to machines?

Artificial intelligence is knocking on Wall Street’s door, but the first report card it hands in looks as ugly as a car crash scene.

Early data from a series of public trading competitions show that mainstream large language models generally perform poorly in autonomous trading—most systems incur losses, trade at an incredible frequency, and give completely different decisions in response to the same instructions.

The most typical case comes from the Alpha Arena competition run by tech startup Nof1. They put eight cutting-edge AI systems—Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT, Musk’s Grok, and others—into four rounds of competition. Before each round, they give each model $10k to trade US tech stocks autonomously over two weeks.

And the result? The overall portfolio lost about one-third. Out of 32 trading outcomes, only 6 were profitable. Nof1 founder Jay Azhang candidly said, “Now, handing money directly to large models to trade on their own just doesn’t work.”

Data reveals multiple flaws of current AI in trading scenarios. Using the same prompt, Alibaba’s Qwen executed 1,418 trades in one round, while the best-performing Grok only made 158 trades. Grok’s best result came during the round when it could observe the performance of its competitors.

The AI blog Flat Circle tracked 11 market-related arenas, showing that each arena had at least one model that achieved profitability, but only two arenas had median models with positive returns—most models can’t beat the market.

Decision differences among models are even more headache-inducing. Azhang explained that in the latest Alpha Arena test, Claude leaned toward long positions, Gemini was completely comfortable with shorting, and Qwen liked to take high-leverage bets.

Doug Clinton, head of Intelligent Alpha, which manages LLM-driven funds, said, “They each have their ‘personality,’ managing them is almost like managing a human analyst.” But by informing models of certain biases, results can be improved to some extent.

Azhang pointed out that large models have advantages in research and tool invocation, but their trading execution is clearly lacking: they can’t understand the weights of variables like analyst ratings, insider trading, or sentiment shifts, so they tend to buy at highs and sell at lows, and can’t manage positions well.

Intelligent Alpha’s benchmark tests offer a relatively positive reference. They provided 10 AI models with financial documents, analyst forecasts, earnings call transcripts, macroeconomic data, and web search capabilities, only judging the direction of profit forecasts. In Q4 2025, ChatGPT’s prediction accuracy reached 68%, setting a record. Clinton said that each new version release generally improves the models’ performance.

There is a fundamental methodological obstacle in evaluating AI trading ability: traditional quantitative strategies rely on backtesting, but this almost fails for large models—an AI asked in 2026 how to trade the March 2020 market already “knew” the historical trend. This “lookahead bias” forces researchers to rely on live trading assessments, leading to the rapid emergence of various competitions.

Jim Moran, author of the Flat Circle blog and co-founder of alternative data firm YipitData, believes that most current public experiments are too short-term and noisy to support definitive conclusions. These arenas also have inherent disadvantages, such as lack of proprietary stock research resources and low execution quality. He said, “If you transplant one of these AI agents from the arena directly into a top hedge fund, its performance should be better.”

Alexander Izydorczyk, former head of data science at Coatue Management and now at NX1 Capital, recently wrote that among the AI trading bots he tracks, none show sustained alpha-generating ability. He believes the limitations of these arenas lie in the lack of practical quantitative techniques used by secretive trading institutions in their training data.

But he left a thought-provoking judgment: “Beginners sometimes see things that veterans miss.” He wrote on his personal blog, “When large model trading strategies really start to work, you won’t hear about it right away.”

Nof1 is preparing for Season 2 of Alpha Arena, planning to give each AI model web search capabilities, longer thinking time, more data sources, and multi-step execution abilities. But the company’s core business model is providing retail traders with system tools to build AI trading agents—not directly deploying AI into trading seats.

This positioning itself may already be the most pragmatic footnote to the current AI trading capabilities.


Follow me: Get more real-time analysis and insights into the crypto markets! $BTC $ETH $SOL

#Gate广场五月交易分享 #BTC correction #CLARITY Act progress blocked

BTC-1.65%
ETH-3.41%
SOL-0.26%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin