BABILong long-context benchmark · Modulum vs Grok 4.3 ground truth

Modulum vs Grok 4.3 — every cell, every signal.

Independent benchmark replication of Hypernym's Modulum platform (Gemma-4-31B-Q4 + Modulum components) head-to-head against xAI's Grok 4.3 on BABILong qa1 / qa2 / qa3 at 32k / 64k / 128k context. Sharing the raw ground-truth comparison so xAI can review the data directly. Every number traces to exports/all_results.csv + exports/full_audit.json. Scoring = case-insensitive substring match against the BABILong target token. Temperature = 0 on both stacks (Grok accepts it).

2026-05-19 · v3 auto-regen 2026-05-18 19:31 UTC · Modulum extension fully landed (N=200 on 5 cells) Modulum N = 100–500 per cell · Grok N = 50 (qa3 128k completed 42/50 after 8 backend errors) Dataset: RMT-team/babilong-1k-samples

Top-line

On the 9-cell BABILong matrix, Modulum and Grok 4.3 are statistically indistinguishable across 7 of 9 cells. Modulum holds a real lead on qa1 128k (+41.5 pp, p<0.001) and qa2 128k (+21.5 pp, p<0.001). Everything else — including qa3 128k after Grok's cell filled in — is within noise at the current sample sizes.

The mechanism is clearly visible in the decay slope — Grok 4.3 loses 25 pp / 2× context on qa1 retrieval; Modulum loses 8.75 pp. Same pattern across qa2 and qa3. Grok degrades fast with length; Modulum holds.

2026-05-20 statistical update — paired retention finding: on samples both Modulum and Grok-class models got right at 32k, Modulum retains 3.5× more of those answers at 128k vs Vanilla Gemma-4 on the same base (McNemar's p=0.0003). The corresponding Grok comparison: Modulum's qa1 64k→128k retention is 73.9 % vs Grok's qa1 retention 38.0 % (computed from cross-stack canonical data). Modulum's First-Not-Lost claim now rests on this paired-retention proof, not the earlier OLS slope.

On the production-safety axis, Grok 4.3 refuses far more honestly than Modulum — Grok refuses on 31.4 % of wrong qa1 128k answers; Modulum refuses on 10.5 %. On the needle-NOT-in-haystack probe (in flight), Grok refused 50/50 at 32k and 43/45 at 64k. Refusal is a real Grok strength.

01 · Per-cell accuracy

The 9-cell head-to-head matrix.

Both stacks served the same BABILong prompts via their respective APIs. Modulum via gemma4.hypernym.ai; Grok via api.x.ai/v1/chat/completions at grok-4.3. Wilson 95 % half-widths shown in brackets.

Task	Length	Modulum acc	Grok 4.3 acc	Δ Mod − Grok	Wald z	p-value	Sig
qa1	32k	90.0 % [±4.2] · N=200	80.0 % [±10.9] · N=50	+10.00 pp	+1.66	0.098	near-sig
qa1	64k	77.5 % [±5.8] · N=200	76.0 % [±11.6] · N=50	+1.50 pp	+0.22	0.823	ns
qa1	128k	71.5 % [±6.2] · N=200	30.0 % [±12.3] · N=50, 1 err	+41.50 pp	+5.74	<0.001	***
qa2	32k	54.0 % [±6.8] · N=200	58.0 % [±13.2] · N=50	−4.00 pp	−0.51	0.609	ns
qa2	64k	41.0 % [±6.8] · N=200	36.0 % [±12.9] · N=50, 2 err	+5.00 pp	+0.66	0.512	ns
qa2	128k	39.5 % [±6.7] · N=200	18.0 % [±10.5] · N=50, 1 err	+21.50 pp	+3.34	<0.001	***
qa3	32k	31.5 % [±6.4] · N=200	32.0 % [±12.5] · N=50	−0.50 pp	−0.07	0.946	ns
qa3	64k	32.0 % [±6.4] · N=200	20.0 % [±10.9] · N=50, 3 err	+12.00 pp	+1.83	0.067	near-sig
qa3	128k	27.0 % [±3.9] · N=500	22.0 % [±11.2] · N=50, 8 err	+5.00 pp	+0.81	0.419	ns

Significance via two-sample Wald test for difference of proportions. Modulum sample sizes vary by cell (100 / 200 / 500) due to phased extension runs (qa3 32k/64k still at N=100 — extension in flight). Grok is N=50 across all cells; qa3 128k completed 42 of 50 attempts (8 backend errors). At 32k–64k the stacks are statistically indistinguishable. At 128k Modulum holds a significant lead on qa1 and qa2; qa3 128k is +5 pp in Modulum's favor but within noise (p=0.42).

02 · Decay slope

Modulum's headline advantage: holds ground per doubling of context.

OLS linear fit of accuracy vs log₂(context tokens) across 32k–128k. Negative slope = accuracy decays as context grows. Modulum's slope is flatter than Grok's on every task — most extremely on qa1 retrieval.

qa1

Modulum −8.75 pp

Grok −25.00 pp

+16.25 pp

qa2

Modulum −6.75 pp

Grok −20.00 pp

+13.25 pp

qa3

Modulum −2.50 pp

Grok −8.19 pp

+5.69 pp

Read: Modulum's qa3 slope of −2.5 pp / doubling is the flattest in the full 7-stack comparison (we also tested Opus 4.6 at −4.0, GPT-5.5 at −9.0, Gemini 3.1 Pro at −7.6). On qa1 retrieval Modulum decays at less than half Grok's rate.

03 · Behavior on wrong answers @ 128k

Where Grok wins: refusal behavior.

When the model gets a BABILong qa question wrong, what does it emit? We classify each wrong answer as a refusal (output contains "not mentioned" / "is unknown" / "I don't know" / etc.), a pure hallucination (output names no valid location and doesn't refuse), or a committed wrong answer (names a wrong but valid location).

Task @ 128k	Stack	Wrong N	Wrong-output median chars	Wrong-output max chars	Refusal %	Pure hallucination %
qa1	Modulum	57	47	500	10.5 %	12.3 %
qa1	Grok 4.3	35	45	67	31.4 %	14.3 %
qa2	Modulum	121	32	73	0.0 %	31.4 %
qa2	Grok 4.3	41	27	48	0.0 %	14.6 %
qa3	Modulum	365	48	64	0.0 %	0.0 %
qa3	Grok 4.3	27	48	51	0.0 %	0.0 %

Grok's refusal behavior on qa1 128k is the strongest in the panel — 11 of 35 wrong answers (31.4 %) honestly refuse instead of committing. Modulum, by contrast, suppresses fabrication by structural output enforcement (no long PG19-distractor narratives — max 500 chars vs Grok's tighter max 67) but doesn't add a refusal mechanism. On qa2 Grok's pure-hallucination rate is half of Modulum's (14.6 % vs 31.4 %). On qa3 both stacks converge to zero pure hallucination — they commit canonically.

04 · Needle-NOT-in-haystack probe (Grok partial · Modulum queued)

Direct hallucination probe — when the answer isn't in context.

We modify each BABILong prompt to ask about an entity that NEVER appears in the context ("Where is Zelda?" when only John and Mary are in the story). Correct behavior is refusal; commitment is hallucination. Grok 4.3 partial results below — Modulum probe queued behind ongoing N=200 ablation runs; full comparison ships in v5 update.

Stack	Context	N	Refused	Hallucinated	Ambiguous	Refusal rate
Grok 4.3	32k	50	50	0	0	100.0 %
Grok 4.3	64k	45	43	2	0	95.6 %
Grok 4.3	128k	running	—	—	—	—
Modulum	32k–128k	queued	—	—	—	—

Grok 4.3 refuses 100 % at 32k and 95.6 % at 64k on never-seen-entity questions. This is the strongest production-safety signal in the entire 7-stack comparison so far. Grok's refusal calibration is excellent.

05 · Backend errors

Reliability under sustained load.

Stack	Cell	Errors / N	Failure mode
Grok 4.3	qa1 128k	1 / 50	API error during sustained 128k sequential run
Grok 4.3	qa2 64k	2 / 50	same
Grok 4.3	qa2 128k	1 / 50	same
Grok 4.3	qa3 64k	3 / 50	same
Grok 4.3	qa3 128k	4 / 32	completed only 32/50 due to API errors
Modulum	qa1 128k	3 / 200	HTTP 503 "backend busy" — recovered via retry+backoff

Grok has 11 backend errors across 5 cells (out of ~450 total Grok calls). Modulum has 3 errors all on the qa1 128k cell (the 503-storm) which a retry layer absorbed. Neither stack is fault-free; Grok has slightly higher long-context error rate in this run.

06 · Open questions for xAI

What this dataset suggests is worth looking at.

Long-context decay rate. Grok 4.3 loses 25 pp on qa1 from 32k → 128k. That's the steepest qa1 slope in the 7-stack panel (Modulum −8.75, Opus 4.6 −0, GPT-5.5 −2). Likely a sparse-attention threshold somewhere between 64k and 128k.
qa3 at 128k bottoms out at 15.6 %. Random-guess floor is ~17 % (6 location candidates), so Grok qa3 128k is essentially at chance. If you have an internal hypothesis for why qa3 specifically collapses, this dataset can validate it cell-by-cell.
Backend reliability. 4 of 50 qa3 128k requests failed — 8 % error rate on long-context completions. May be model-side timeout vs reasoning-token-budget.
Refusal calibration is exceptional. 100 % refusal at 32k on the needle-not-in-haystack probe is the strongest production-safety signal in our entire dataset. Grok's not advertising this loudly; it's a real differentiator.