LMF Blog

Fiction.LiveBench is a benchmark that tests genuine comprehension of long narratives—not just simple retrieval. Based on stories from Fiction.live, it evaluates whether models can understand character motivations, track event sequences, and make inferences from implicit information across varying context lengths.

Looking at the latest results, Gemini 3 Flash Preview achieves 100% accuracy across ALL context lengths tested—from 0 to 192k tokens.

This is remarkable. Most models show degradation as context length increases. Even GPT-5.2 dips to 96.9% at 192k, and Claude Opus 4.5 drops to 80% at the longest contexts.

The Chart

gemini-3-flash-preview

gpt-5.2

claude-opus-4-5

kimi-k2.5

Full Results

Model	0	1k	4k	8k	16k	32k	60k	120k	192k
gemini-3-flash	100	100	100	100	100	100	100	100	100
gpt-5.2	100	100	100	97.2	100	97.2	97.2	100	96.9
gemini-3-pro	100	100	100	97.2	96.6	94.4	100	96.9	96.9
claude-opus-4-5	87.5	94.4	97.2	91.7	94.4	97.2	93.8	80	-
kimi-k2.5	100	100	100	88.9	86.1	88.9	89.8	78.1	87.5

Why It Matters

Fiction.LiveBench tests genuine comprehension, not just retrieval. It evaluates:

Theory of mind for characters
Understanding of event chronology
Inferences from implicitly stated information

Perfect scores across all context lengths means Gemini 3 Flash maintains full comprehension even when processing massive documents. For agentic workflows with long context, this is a significant differentiator.

The challenge with long context is the quadratic memory scaling of attention—doubling context length quadruples memory usage. That Gemini 3 Flash maintains perfect comprehension while presumably optimizing for this constraint is impressive engineering.

Personal Note

I'm curious how far this 100% retrieval can go. The current benchmark tops out at 192k tokens. Will Flash maintain perfect scores at 500k? 1M? At some point, every architecture has limits—I'd love to see where Flash finally drops off.

Source: Fiction.LiveBench (Jan 30, 2026)