Blog
Back to all posts

Gemini 3.1 Pro and the Downfall of Benchmarks

Gemini 3.1 Pro is, by many measures, the most capable AI model ever released. It tops 13 out of 16 industry benchmarks. It more than doubled its predecessor's score on ARC-AGI-2. It dominates coding, science, and agentic tasks.

And yet, if you ask people which model is "best," you'll get five different answers. That's not because people are confused. It's because benchmarks are broken.

The Benchmark Crisis

Here's what changed: post-training now eats roughly 80% of total compute. The pre-training phase—where a model ingests the internet and learns general patterns—used to be the whole game. Now it's just the foundation. The real differentiation happens after, when models get fine-tuned, RLHF'd, and optimized for specific domains.

This means a model can be surgically optimized to crush a particular benchmark without becoming generally smarter. Excelling at coding doesn't mean excelling at creative writing. Dominating science questions doesn't mean dominating real-world enterprise tasks. The correlation between "high benchmark score" and "better model" has quietly decoupled.

The chess example makes this visceral: Opus 4.6 scores around 10% on chess puzzles, while the older Sonnet 4.5 hits 12%. Nobody in their right mind would argue Sonnet 4.5 is a better model than Opus 4.6. But if you judged purely by that benchmark, you'd reach exactly that conclusion. General reasoning ability is no longer uniform across domains—it's lumpy, uneven, and shaped by where training compute was allocated.

Gemini 3.1 Pro: Where It Shines

Credit where it's due—Gemini 3.1 Pro is genuinely impressive in specific areas.

BenchmarkGemini 3.1 ProOpus 4.6GPT-5.2
ARC-AGI-277.1%37.6%54.2%
GPQA Diamond94.3%91.3%92.4%
SWE-Bench Verified80.6%72.6%-
APEX-Agents33.5%29.8%23.0%
HLE (no tools)44.4%41.2%34.5%

That ARC-AGI-2 jump is staggering—more than double its predecessor's 31.1%. The SWE-Bench score puts it ahead of every competitor on real-world GitHub issue resolution. And it achieves all of this at the same $2/M input token price as Gemini 3 Pro. Performance-per-dollar, nothing else comes close.

Where It Doesn't

Now look at what Google didn't lead with.

BenchmarkGemini 3.1 ProBest Competitor
HLE (with tools)51.4%53.1% (Opus 4.6)
GDPval-AA13171633 (Sonnet 4.6)
Terminal-Bench 2.068.5%77.3% (GPT-5.3-Codex)

The GDPval-AA gap is striking—it's a measure of broad enterprise task performance, and Sonnet 4.6 beats Gemini 3.1 Pro by nearly 300 points. When you give models actual tools to use (HLE with tools), Opus 4.6 edges ahead. And on terminal-based coding tasks, GPT-5.3-Codex still leads comfortably.

Then there's the hallucination problem. Gemini 3.1 Pro has improved dramatically—dropping from an 88% hallucination rate (Gemini 3 Pro) to about 50%. But that still means half of its wrong answers are confidently wrong, not uncertain. It doesn't say "I'm not sure." It states falsehoods with conviction. Claude and GLM-5 are notably better at expressing uncertainty when they don't know something.

Welcome to the Vibe Era

So which model is best? That's the wrong question now.

We've entered what you might call the vibe era of AI evaluation. Every major model excels at something and falls flat at something else. Gemini 3.1 Pro is the best reasoning model. Opus 4.6 is the best tool-using model. Sonnet 4.6 dominates enterprise workflows. GPT-5.3-Codex leads terminal coding. No single benchmark captures any of this.

The old framework was simple: bigger model + more data = better at everything. That framework is dead. Post-training specialization has shattered the idea that a single leaderboard can rank models meaningfully. A model that tops 13 of 16 benchmarks can still be the wrong choice for your specific use case.

What should replace benchmarks? Probably nothing universal. The honest answer is: try the models on your actual tasks. The era of trusting a leaderboard number to tell you which AI to use is over. We're going to have to vibe it out.


Sources: