ARC-AGI-3
Explained

The first interactive reasoning benchmark — and the hardest reality check for AGI claims

Launched March 2026 135 Novel Environments Humans: 100% Best AI: 0.37%

The test where every frontier model scores under 1%.

ARC-AGI-3 is the third generation of the Abstraction and Reasoning Corpus, created by François Chollet and Mike Knoop's ARC Prize Foundation. It's the first fully interactive AI benchmark — instead of solving static grid puzzles, AI agents are dropped into turn-based game environments with zero instructions, zero stated goals, and no description of the rules. The agent must explore, figure out what winning looks like, build a mental model of the environment, and execute a strategy. All from scratch.

Think of it like handing someone a game controller with no tutorial. A five-year-old figures it out in minutes. The most expensive AI systems on the planet cannot. That gap — between human adaptability and AI rigidity — is exactly what ARC-AGI-3 was built to measure.

The Problem

Old benchmarks are saturated

ARC-AGI-1 is essentially solved (Gemini 3.1 Pro hits 98%). ARC-AGI-2 is approaching saturation at 84.6%. Worse, evidence suggests frontier models may be implicitly trained on ARC data — their reasoning chains reference ARC-specific colour mappings without being told. The signal was dying.

The Shift

Static puzzles → interactive environments

ARC-AGI-3 replaces image-in/image-out grids with 135 handcrafted turn-based games across 1,000+ levels. Each game has 8–10 levels that progressively introduce new mechanics. Agents observe a 64×64 grid with 16 colours, take actions, and must learn on the fly.

The Result

A measurable intelligence gap

For the first time, there's a direct, quantitative comparison of human vs. AI learning efficiency on novel tasks. Not memorisation. Not pattern matching. Actual adaptive reasoning measured across time. The gap is enormous — and that's the point.

👁️
Perceive
See the grid state
🔍
Explore
Try actions, observe results
🧠
Model
Build world model & goals
🎯
Execute
Plan and act efficiently

The AGI reality check the industry needed.

In the same week that Jensen Huang declared AGI "already achieved" on Lex Fridman's podcast, ARC-AGI-3 dropped — and every frontier model scored below 1%. The benchmark arrives at a moment when the gap between AGI claims and demonstrable capability has never been wider. Labs are racing to declare victory while the simplest test of genuine adaptability exposes how far there is to go.

100%
Human solve rate (no instructions)
<1%
Every frontier model tested
$2M
Total prize pool (ARC Prize 2026)
1,200+
Human players in preview testing

But isn't this just moving the goalposts?

Fair question. ARC-AGI-1 was released in 2019 and took five years to approach saturation. ARC-AGI-2 saw scores jump from 3% to 84.6% in under a year. These weren't goalpost shifts — they were benchmarks doing their job: tracking real capability gains, then getting replaced when they lost signal. ARC-AGI-3 is the next iteration, designed around the specific gap that remains: interactive, adaptive reasoning in truly novel environments. It measures the thing nobody has cracked yet.

Four pillars of agentic intelligence.

ARC-AGI-3 tests four capabilities that the technical paper identifies as necessary for general intelligence. These aren't arbitrary — they map directly to what humans do naturally when encountering something unfamiliar.

Pillar 01

Modelling

Turning raw observations into a generalisable world model that can predict future states and outcomes. Inherited from ARC-AGI-1 and 2, but now applied dynamically — the model must update as the environment reveals new mechanics across levels.

Pillar 02

Goal-Setting

Identifying desirable future states without explicit instructions. There are no win conditions displayed. The agent must independently determine what to target based on environmental cues and its own evolving understanding of the game.

Pillar 03

Planning & Execution

Mapping an action path from the current state to the identified goal, with the ability to course-correct based on environmental feedback. This requires both initial strategic accuracy and the agility to adapt when things don't go as expected.

Pillar 04

Efficiency (RHAE Scoring)

Intelligence is measured as action efficiency, not just task completion. The formula: (human actions / AI actions)². An AI taking 10× as many steps as a human scores just 1%. This squared penalty makes brute-force approaches mathematically unviable and directly measures learning speed for the first time.

Why square the ratio?

The squared penalty is deliberate. If a human solves a game in 10 steps and an AI takes 100, a linear metric would give the AI 10%. The squared formula gives it 1%. At 200 steps: 0.25%. At 500 steps: 0.04%. This design choice forces systems to actually learn the environment's logic rather than systematically trying every possible action. Being faster than the human earns no bonus — the per-level score caps at 1.0. Later levels carry more weight because they require deeper understanding.

Frontier models vs. humans: the numbers.

The official leaderboard tests models via API with an identical system prompt — no task-specific scaffolding, no custom tooling. This is deliberate. If you need human-engineered workarounds for every new task, that's not general intelligence.

OFFICIAL BENCHMARK SCORES — MARCH 2026

Humans (no instructions) 100%
StochasticGoose — CNN + RL (preview winner) 12.58%
Gemini 3.1 Pro Preview 0.37%
GPT-5.4 High 0.26%
Claude Opus 4.6 0.25%
Grok-4.20 0.00%

Frontier LLM scores are from official API evaluation with identical system prompts. Preview winner used CNN + reinforcement learning, not an LLM. Score bars for sub-1% models are exaggerated for visibility.

The scaffolding problem

Duke University built a custom harness that pushed Opus 4.6 to 97.1% on one known environment (TR87). On an unfamiliar environment (BP35): 0%. The intelligence lived in the human-built scaffolding, not the model. Chollet's argument: if the model needs task-specific human engineering to function, you're measuring the human's intelligence, not the AI's.

Non-LLM approaches win

The top three preview entries were all non-LLM solutions — CNN-based, rule-based state graph exploration, and frame graph search without any training. A simple CNN agent outperformed GPT-5.4 by more than 12 percentage points. This suggests the path to ARC-AGI-3 runs through novel algorithmic ideas, not bigger language models.

What people are actually arguing about.

ARC-AGI-3 has landed in a highly polarised discourse. Here are the core fault lines — and where both sides have a point.

Methodology

"The scoring is designed to produce low numbers"

Critics point out the squared efficiency penalty, the human baseline calibration, and the exclusion of extended-thinking models. The counter: if you need 10× more actions than a human who has never seen the task, efficiency isn't an unfair metric — it's the entire point.

Scaffolding

"Models just need better prompts and tools"

The 97.1% → 0% scaffolding result is the strongest evidence against this. Task-specific engineering works on known tasks but fails completely on novel ones. Chollet's position: if the 'G' in AGI means anything, the system shouldn't need a human hand-holding it through every new problem.

Architecture

"Can transformers ever do this?"

NYU professor Saining Xie argues LLMs are inherently limited because they learn entirely from human-generated text rather than from raw experience. The models that crack ARC-AGI-3 may need to be a fundamentally different kind of system — one that learns by doing, not by reading.

Human Baseline

"100% is misleading"

Sceptics note the baseline uses the second-best of ten first-time players — near the top of the distribution, not the average person. If the baseline were set to the median, the gap would still be large, but the headline framing would be more honest. The ARC team chose a strong baseline deliberately.

Data Contamination

"Old benchmarks were polluted"

Evidence from Gemini 3's reasoning chain — correctly referencing ARC's integer-to-colour mapping without being told — suggests models have been implicitly trained on ARC data. ARC-AGI-3's interactive format makes contamination structurally harder. You can't pre-train on an environment you explore in real time.

Pace of Progress

"Give it six months"

Labs pushed ARC-AGI-2 from 3% to 84.6% in under a year. The counterargument: those gains were largely from scaffolding and scale, which ARC-AGI-3 is specifically designed to neutralise. Whether frontier labs can climb this ladder the same way is genuinely open.

From grid puzzles to interactive worlds.

ARC has tracked — and accurately predicted — every major shift in AI reasoning capability since 2019. Here's the trajectory.

2019

ARC-AGI-1 launches

Chollet releases the original Abstraction and Reasoning Corpus alongside "On the Measure of Intelligence." Static grid puzzles, image-in/image-out. Early AI systems score near 0%. The first Kaggle competition (2020) draws 913 teams; the winner hits ~20% using brute-force program search.

2024

ARC Prize 2024 & the reasoning breakthrough

ARC Prize Foundation launches with $1M+ in prizes and 1,430 teams. Test-time training reaches 53.5% on ARC-AGI-1. OpenAI's o3 demonstrates that large reasoning models can exhibit genuine fluid intelligence on the benchmark — not just pattern matching.

2025

ARC-AGI-2 and rapid saturation

Version 2 launches with harder compositional puzzles. NVIDIA's NVARC team wins first place at 24%. But scaffolding and scale push scores to 84.6% within months. Evidence of data contamination emerges. The benchmark begins losing scientific signal.

MAR 2026

ARC-AGI-3 resets the scoreboard

First fully interactive benchmark. 135 handcrafted environments, 1,000+ levels. In-house game studio. $2M prize pool across two tracks. Launched at Y Combinator with a fireside between Chollet and Sam Altman. Every frontier model scores under 1%. A CNN-based agent leads at 12.58%.

Should you care about ARC-AGI-3?

ARC-AGI-3 isn't relevant to every AI project. Here's an honest breakdown of who should be paying attention — and who can safely ignore it.

✓ Pay attention if

You're building autonomous agents that need to handle genuinely novel situations without human handholding — warehouse robotics, adaptive game AI, exploration systems.

You're evaluating AI vendor claims about "AGI" capabilities and need a grounded reference point for what's actually been demonstrated.

You're doing research in reinforcement learning, world models, or program synthesis — ARC-AGI-3's format maps directly to open problems in your field.

You care about the architectural question: whether transformers/LRMs can achieve genuine adaptability or whether fundamentally new approaches are needed.

✗ Safely ignore if

You're shipping products that use AI for well-defined tasks — coding assistants, summarisation, content generation. LLMs are excellent at these. ARC-AGI-3 doesn't change that.

You're using AI as a tool within human-designed workflows. The scaffolding critique doesn't invalidate productive human–AI collaboration — it clarifies what counts as autonomy vs. automation.

Your timeline is this quarter, not this decade. ARC-AGI-3 is about long-term capability trajectories. Current AI is already delivering massive value in narrow, well-defined domains.

You're conflating "useful" with "intelligent." AI doesn't need to be AGI to transform your business. Most of the value comes from applied intelligence, not general intelligence.

How it fits. What to consider.

Key considerations

ARC-AGI-3 is the most important AI benchmark of 2026 — not because it humbles frontier models, but because it clarifies the conversation. The era of arguing whether AI is smart is over. The question is what kind of smart, and whether current architectures can get to genuine adaptability. For businesses, the practical takeaway is simpler: AI is extraordinarily capable within structured workflows. Don't wait for AGI to capture that value. But also don't confuse product-market fit with scientific progress. They're different races.

Enterprise

Pressure-test vendor claims

When a vendor says "AGI-level," ask what they score on ARC-AGI-3. Not because sub-1% means their product is useless — it almost certainly isn't — but because it tells you whether you're buying narrow capability dressed up as general intelligence. That distinction matters for long-term architecture decisions.

Studio

Design for the scaffolding reality

The 97.1% → 0% scaffolding result is a direct lesson for product teams. Good scaffolding ships useful systems today — that's not the problem. The problem is assuming it will generalise. Build for graceful degradation when the model hits something genuinely novel.

Dojo

Play the games yourself

Seriously. The fastest way to understand what AI can and can't do right now is to play the public ARC-AGI-3 environments ↗ and then watch the replay of a frontier model attempting the same thing. The gap becomes visceral in a way no benchmark number can convey.

Go deeper. Play the games.

Sources & References

ARC Prize Foundation — Launch Blog · ARC-AGI-3 Technical Report · RHAE Scoring Methodology · ARC Prize 2025 Results · The Decoder — ARC-AGI-3 Analysis · Fast Company — ARC-AGI-3 Coverage · Decrypt — Benchmark Analysis

Content validated March 2026. ARC-AGI is a project of the ARC Prize Foundation, co-founded by François Chollet and Mike Knoop. This is an independent educational explainer by Imbila.AI.