Blog
Back to all posts
TIL

Gemini 3 Flash: Perfect Long Context Scores

Fiction.LiveBench is a benchmark that tests genuine comprehension of long narratives—not just simple retrieval. Based on stories from Fiction.live, it evaluates whether models can understand character motivations, track event sequences, and make inferences from implicit information across varying context lengths.

Looking at the latest results, Gemini 3 Flash Preview achieves 100% accuracy across ALL context lengths tested—from 0 to 192k tokens.

This is remarkable. Most models show degradation as context length increases. Even GPT-5.2 dips to 96.9% at 192k, and Claude Opus 4.5 drops to 80% at the longest contexts.

The Chart

75%81%88%94%100%04001k2k4k8k16k32k60k
gemini-3-flash-preview
gpt-5.2
claude-opus-4-5
kimi-k2.5

Full Results

Model01k4k8k16k32k60k120k192k
gemini-3-flash100100100100100100100100100
gpt-5.210010010097.210097.297.210096.9
gemini-3-pro10010010097.296.694.410096.996.9
claude-opus-4-587.594.497.291.794.497.293.880-
kimi-k2.510010010088.986.188.989.878.187.5

Why It Matters

Fiction.LiveBench tests genuine comprehension, not just retrieval. It evaluates:

  • Theory of mind for characters
  • Understanding of event chronology
  • Inferences from implicitly stated information

Perfect scores across all context lengths means Gemini 3 Flash maintains full comprehension even when processing massive documents. For agentic workflows with long context, this is a significant differentiator.

The challenge with long context is the quadratic memory scaling of attention—doubling context length quadruples memory usage. That Gemini 3 Flash maintains perfect comprehension while presumably optimizing for this constraint is impressive engineering.

Personal Note

I'm curious how far this 100% retrieval can go. The current benchmark tops out at 192k tokens. Will Flash maintain perfect scores at 500k? 1M? At some point, every architecture has limits—I'd love to see where Flash finally drops off.


Source: Fiction.LiveBench (Jan 30, 2026)