ARC Prize Foundation has released the ARC-AGI-3 human performance dataset, which includes test results from 458 participants across 135 abstract reasoning environments without game instructions. All environments were completed by humans, and AGI has not yet been achieved. Meanwhile, the foundation has adjusted the scoring rules to slightly increase scores for both humans and AI.

MeNews

2026-05-06 20:21:18

Abstract generation in progress

ME News Report, April 15 (UTC+8), according to Beating Monitoring, ARC Prize Foundation announced the human performance dataset for ARC-AGI-3, which is the largest human testing study in the ARC-AGI series to date, with 458 participants. The dataset includes 342 complete human operation playback records, covering 25 public environments, all open-sourced. ARC-AGI-3 contains 135 abstract reasoning environments, where testers do not receive any gameplay instructions and must explore, infer rules, and develop strategies on their own. The tests are conducted at an offline testing center in San Francisco, each lasting 90 minutes, with participants earning about $130 base pay plus $5 for each environment they pass. All tests are “first-time pass” conditions, meaning each person only sees the environment once and attempts it once, measuring learning and adaptation abilities when facing entirely new problems. Humans and AI are given exactly the same information, with no information gap.
Core conclusion: All environments in ARC-AGI-3 are passed by humans, with at least two independent participants completing each environment, and most environments being passed by more than five people. The ARC Prize Foundation states, “We have not yet achieved AGI; this dataset is evidence.”
Since the preview of ARC-AGI-3, nearly one million AI evaluation submissions have been received for the open environments. Based on this data, the foundation also announced two scoring rule adjustments: first, changing the human benchmark per level from the “second-best player” to the “median player” to reduce luck’s influence on scores; second, increasing the maximum score per level from 100% to 115% to prevent poor performance on one level from dragging down overall results. The net effect of these two adjustments is a slight increase of about 0.5 percentage points in both human and AI scores. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
463.52K Popularity
#
BitcoinHoldsFirmAbove80K
94.32M Popularity
#
CryptoMarketRecovery
122.87K Popularity
#
AaveSuesToUnfreeze73MInETH
1.85M Popularity
#
DailyPolymarketHotspot
830.56K Popularity

Sitemap

ARC-AGI-3 announces the largest human testing in history: humans have conquered all levels, but AI still has gaps

Trending Topics

GateSquareMayTradingShare

BitcoinHoldsFirmAbove80K

CryptoMarketRecovery

AaveSuesToUnfreeze73MInETH

DailyPolymarketHotspot

Pin