Fiction.LiveBench is a benchmark that tests genuine comprehension of long narratives—not just simple retrieval. Based on stories from Fiction.live, it evaluates whether models can understand character motivations, track event sequences, and make inferences from implicit information across varying context lengths.
Looking at the latest results, Gemini 3 Flash Preview achieves 100% accuracy across ALL context lengths tested—from 0 to 192k tokens.
This is remarkable. Most models show degradation as context length increases. Even GPT-5.2 dips to 96.9% at 192k, and Claude Opus 4.5 drops to 80% at the longest contexts.
The Chart
Full Results
| Model | 0 | 1k | 4k | 8k | 16k | 32k | 60k | 120k | 192k |
|---|---|---|---|---|---|---|---|---|---|
| gemini-3-flash | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| gpt-5.2 | 100 | 100 | 100 | 97.2 | 100 | 97.2 | 97.2 | 100 | 96.9 |
| gemini-3-pro | 100 | 100 | 100 | 97.2 | 96.6 | 94.4 | 100 | 96.9 | 96.9 |
| claude-opus-4-5 | 87.5 | 94.4 | 97.2 | 91.7 | 94.4 | 97.2 | 93.8 | 80 | - |
| kimi-k2.5 | 100 | 100 | 100 | 88.9 | 86.1 | 88.9 | 89.8 | 78.1 | 87.5 |
Why It Matters
Fiction.LiveBench tests genuine comprehension, not just retrieval. It evaluates:
- Theory of mind for characters
- Understanding of event chronology
- Inferences from implicitly stated information
Perfect scores across all context lengths means Gemini 3 Flash maintains full comprehension even when processing massive documents. For agentic workflows with long context, this is a significant differentiator.
The challenge with long context is the quadratic memory scaling of attention—doubling context length quadruples memory usage. That Gemini 3 Flash maintains perfect comprehension while presumably optimizing for this constraint is impressive engineering.
Personal Note
I'm curious how far this 100% retrieval can go. The current benchmark tops out at 192k tokens. Will Flash maintain perfect scores at 500k? 1M? At some point, every architecture has limits—I'd love to see where Flash finally drops off.
Source: Fiction.LiveBench (Jan 30, 2026)