The first interactive reasoning benchmark — and the hardest reality check for AGI claims
ARC-AGI-3 is the third generation of the Abstraction and Reasoning Corpus, created by François Chollet and Mike Knoop's ARC Prize Foundation. It's the first fully interactive AI benchmark — instead of solving static grid puzzles, AI agents are dropped into turn-based game environments with zero instructions, zero stated goals, and no description of the rules. The agent must explore, figure out what winning looks like, build a mental model of the environment, and execute a strategy. All from scratch.
Think of it like handing someone a game controller with no tutorial. A five-year-old figures it out in minutes. The most expensive AI systems on the planet cannot. That gap — between human adaptability and AI rigidity — is exactly what ARC-AGI-3 was built to measure.
ARC-AGI-1 is essentially solved (Gemini 3.1 Pro hits 98%). ARC-AGI-2 is approaching saturation at 84.6%. Worse, evidence suggests frontier models may be implicitly trained on ARC data — their reasoning chains reference ARC-specific colour mappings without being told. The signal was dying.
ARC-AGI-3 replaces image-in/image-out grids with 135 handcrafted turn-based games across 1,000+ levels. Each game has 8–10 levels that progressively introduce new mechanics. Agents observe a 64×64 grid with 16 colours, take actions, and must learn on the fly.
For the first time, there's a direct, quantitative comparison of human vs. AI learning efficiency on novel tasks. Not memorisation. Not pattern matching. Actual adaptive reasoning measured across time. The gap is enormous — and that's the point.
In the same week that Jensen Huang declared AGI "already achieved" on Lex Fridman's podcast, ARC-AGI-3 dropped — and every frontier model scored below 1%. The benchmark arrives at a moment when the gap between AGI claims and demonstrable capability has never been wider. Labs are racing to declare victory while the simplest test of genuine adaptability exposes how far there is to go.
Fair question. ARC-AGI-1 was released in 2019 and took five years to approach saturation. ARC-AGI-2 saw scores jump from 3% to 84.6% in under a year. These weren't goalpost shifts — they were benchmarks doing their job: tracking real capability gains, then getting replaced when they lost signal. ARC-AGI-3 is the next iteration, designed around the specific gap that remains: interactive, adaptive reasoning in truly novel environments. It measures the thing nobody has cracked yet.
ARC-AGI-3 tests four capabilities that the technical paper identifies as necessary for general intelligence. These aren't arbitrary — they map directly to what humans do naturally when encountering something unfamiliar.
Turning raw observations into a generalisable world model that can predict future states and outcomes. Inherited from ARC-AGI-1 and 2, but now applied dynamically — the model must update as the environment reveals new mechanics across levels.
Identifying desirable future states without explicit instructions. There are no win conditions displayed. The agent must independently determine what to target based on environmental cues and its own evolving understanding of the game.
Mapping an action path from the current state to the identified goal, with the ability to course-correct based on environmental feedback. This requires both initial strategic accuracy and the agility to adapt when things don't go as expected.
Intelligence is measured as action efficiency, not just task completion. The formula: (human actions / AI actions)². An AI taking 10× as many steps as a human scores just 1%. This squared penalty makes brute-force approaches mathematically unviable and directly measures learning speed for the first time.
The squared penalty is deliberate. If a human solves a game in 10 steps and an AI takes 100, a linear metric would give the AI 10%. The squared formula gives it 1%. At 200 steps: 0.25%. At 500 steps: 0.04%. This design choice forces systems to actually learn the environment's logic rather than systematically trying every possible action. Being faster than the human earns no bonus — the per-level score caps at 1.0. Later levels carry more weight because they require deeper understanding.
The official leaderboard tests models via API with an identical system prompt — no task-specific scaffolding, no custom tooling. This is deliberate. If you need human-engineered workarounds for every new task, that's not general intelligence.
Frontier LLM scores are from official API evaluation with identical system prompts. Preview winner used CNN + reinforcement learning, not an LLM. Score bars for sub-1% models are exaggerated for visibility.
Duke University built a custom harness that pushed Opus 4.6 to 97.1% on one known environment (TR87). On an unfamiliar environment (BP35): 0%. The intelligence lived in the human-built scaffolding, not the model. Chollet's argument: if the model needs task-specific human engineering to function, you're measuring the human's intelligence, not the AI's.
The top three preview entries were all non-LLM solutions — CNN-based, rule-based state graph exploration, and frame graph search without any training. A simple CNN agent outperformed GPT-5.4 by more than 12 percentage points. This suggests the path to ARC-AGI-3 runs through novel algorithmic ideas, not bigger language models.
ARC-AGI-3 has landed in a highly polarised discourse. Here are the core fault lines — and where both sides have a point.
Critics point out the squared efficiency penalty, the human baseline calibration, and the exclusion of extended-thinking models. The counter: if you need 10× more actions than a human who has never seen the task, efficiency isn't an unfair metric — it's the entire point.
The 97.1% → 0% scaffolding result is the strongest evidence against this. Task-specific engineering works on known tasks but fails completely on novel ones. Chollet's position: if the 'G' in AGI means anything, the system shouldn't need a human hand-holding it through every new problem.
NYU professor Saining Xie argues LLMs are inherently limited because they learn entirely from human-generated text rather than from raw experience. The models that crack ARC-AGI-3 may need to be a fundamentally different kind of system — one that learns by doing, not by reading.
Sceptics note the baseline uses the second-best of ten first-time players — near the top of the distribution, not the average person. If the baseline were set to the median, the gap would still be large, but the headline framing would be more honest. The ARC team chose a strong baseline deliberately.
Evidence from Gemini 3's reasoning chain — correctly referencing ARC's integer-to-colour mapping without being told — suggests models have been implicitly trained on ARC data. ARC-AGI-3's interactive format makes contamination structurally harder. You can't pre-train on an environment you explore in real time.
Labs pushed ARC-AGI-2 from 3% to 84.6% in under a year. The counterargument: those gains were largely from scaffolding and scale, which ARC-AGI-3 is specifically designed to neutralise. Whether frontier labs can climb this ladder the same way is genuinely open.
ARC has tracked — and accurately predicted — every major shift in AI reasoning capability since 2019. Here's the trajectory.
Chollet releases the original Abstraction and Reasoning Corpus alongside "On the Measure of Intelligence." Static grid puzzles, image-in/image-out. Early AI systems score near 0%. The first Kaggle competition (2020) draws 913 teams; the winner hits ~20% using brute-force program search.
ARC Prize Foundation launches with $1M+ in prizes and 1,430 teams. Test-time training reaches 53.5% on ARC-AGI-1. OpenAI's o3 demonstrates that large reasoning models can exhibit genuine fluid intelligence on the benchmark — not just pattern matching.
Version 2 launches with harder compositional puzzles. NVIDIA's NVARC team wins first place at 24%. But scaffolding and scale push scores to 84.6% within months. Evidence of data contamination emerges. The benchmark begins losing scientific signal.
First fully interactive benchmark. 135 handcrafted environments, 1,000+ levels. In-house game studio. $2M prize pool across two tracks. Launched at Y Combinator with a fireside between Chollet and Sam Altman. Every frontier model scores under 1%. A CNN-based agent leads at 12.58%.
ARC-AGI-3 isn't relevant to every AI project. Here's an honest breakdown of who should be paying attention — and who can safely ignore it.
You're building autonomous agents that need to handle genuinely novel situations without human handholding — warehouse robotics, adaptive game AI, exploration systems.
You're evaluating AI vendor claims about "AGI" capabilities and need a grounded reference point for what's actually been demonstrated.
You're doing research in reinforcement learning, world models, or program synthesis — ARC-AGI-3's format maps directly to open problems in your field.
You care about the architectural question: whether transformers/LRMs can achieve genuine adaptability or whether fundamentally new approaches are needed.
You're shipping products that use AI for well-defined tasks — coding assistants, summarisation, content generation. LLMs are excellent at these. ARC-AGI-3 doesn't change that.
You're using AI as a tool within human-designed workflows. The scaffolding critique doesn't invalidate productive human–AI collaboration — it clarifies what counts as autonomy vs. automation.
Your timeline is this quarter, not this decade. ARC-AGI-3 is about long-term capability trajectories. Current AI is already delivering massive value in narrow, well-defined domains.
You're conflating "useful" with "intelligent." AI doesn't need to be AGI to transform your business. Most of the value comes from applied intelligence, not general intelligence.
ARC-AGI-3 is the most important AI benchmark of 2026 — not because it humbles frontier models, but because it clarifies the conversation. The era of arguing whether AI is smart is over. The question is what kind of smart, and whether current architectures can get to genuine adaptability. For businesses, the practical takeaway is simpler: AI is extraordinarily capable within structured workflows. Don't wait for AGI to capture that value. But also don't confuse product-market fit with scientific progress. They're different races.
When a vendor says "AGI-level," ask what they score on ARC-AGI-3. Not because sub-1% means their product is useless — it almost certainly isn't — but because it tells you whether you're buying narrow capability dressed up as general intelligence. That distinction matters for long-term architecture decisions.
The 97.1% → 0% scaffolding result is a direct lesson for product teams. Good scaffolding ships useful systems today — that's not the problem. The problem is assuming it will generalise. Build for graceful degradation when the model hits something genuinely novel.
Seriously. The fastest way to understand what AI can and can't do right now is to play the public ARC-AGI-3 environments ↗ and then watch the replay of a frontier model attempting the same thing. The gap becomes visceral in a way no benchmark number can convey.
ARC-AGI-3 — Play the games yourself ↗
Launch announcement — Official blog post ↗
Technical paper — Full methodology and results ↗
RHAE scoring docs — How scores are calculated ↗
Leaderboard — Live scores and cost analysis ↗
Kaggle competition — $2M prize pool ↗
ARC Prize 2025 results — Context for v3 ↗
ARC Prize Foundation — Launch Blog · ARC-AGI-3 Technical Report · RHAE Scoring Methodology · ARC Prize 2025 Results · The Decoder — ARC-AGI-3 Analysis · Fast Company — ARC-AGI-3 Coverage · Decrypt — Benchmark Analysis
Content validated March 2026. ARC-AGI is a project of the ARC Prize Foundation, co-founded by François Chollet and Mike Knoop. This is an independent educational explainer by Imbila.AI.