BABILong long-context benchmark · Modulum vs Grok 4.3 ground truth

Modulum vs Grok 4.3 — every cell, every signal.

Independent benchmark replication of Hypernym's Modulum platform (Gemma-4-31B-Q4 + Modulum components) head-to-head against xAI's Grok 4.3 on BABILong qa1 / qa2 / qa3 at 32k / 64k / 128k context. Sharing the raw ground-truth comparison so xAI can review the data directly. Every number traces to exports/all_results.csv + exports/full_audit.json. Scoring = case-insensitive substring match against the BABILong target token. Temperature = 0 on both stacks (Grok accepts it).

2026-05-19 · v2 (Modulum extension + Grok qa3 128k fill-in landed) Modulum N = 100–500 per cell · Grok N = 50 (qa3 128k completed 42/50 after 8 backend errors) Dataset: RMT-team/babilong-1k-samples

Top-line

On the 9-cell BABILong matrix, Modulum and Grok 4.3 are statistically indistinguishable across 7 of 9 cells. Modulum holds a real lead on qa1 128k (+41.5 pp, p<0.001) and qa2 128k (+21.5 pp, p<0.001). Everything else — including qa3 128k after Grok's cell filled in — is within noise at the current sample sizes.

The mechanism is clearly visible in the decay slope — Grok 4.3 loses 25 pp / 2× context on qa1 retrieval; Modulum loses 8.75 pp. Same pattern across qa2 and qa3. Grok degrades fast with length; Modulum holds.

On the production-safety axis, Grok 4.3 refuses far more honestly than Modulum — Grok refuses on 31.4 % of wrong qa1 128k answers; Modulum refuses on 10.5 %. On the needle-NOT-in-haystack probe (in flight), Grok refused 50/50 at 32k and 43/45 at 64k. Refusal is a real Grok strength.

The 9-cell head-to-head matrix.

Both stacks served the same BABILong prompts via their respective APIs. Modulum via gemma4.hypernym.ai; Grok via api.x.ai/v1/chat/completions at grok-4.3. Wilson 95 % half-widths shown in brackets.

Task Length Modulum acc Grok 4.3 acc Δ Mod − Grok Wald z p-value Sig
qa132k90.0 % [±4.2] · N=20080.0 % [±10.9] · N=50+10.00 pp+1.660.098near-sig
qa164k77.5 % [±5.8] · N=20076.0 % [±11.6] · N=50+1.50 pp+0.220.823ns
qa1128k71.5 % [±6.2] · N=20030.0 % [±12.3] · N=50+41.50 pp+5.74<0.001***
qa232k54.0 % [±6.9] · N=20058.0 % [±13.2] · N=50−4.00 pp−0.510.609ns
qa264k41.0 % [±6.8] · N=20036.0 % [±12.9] · N=50+5.00 pp+0.660.512ns
qa2128k39.5 % [±6.7] · N=20018.0 % [±10.5] · N=50+21.50 pp+3.34<0.001***
qa332k32.0 % [±9.0] · N=10032.0 % [±12.5] · N=50+0.00 pp+0.001.000tie
qa364k33.0 % [±9.1] · N=10020.0 % [±10.9] · N=50+13.00 pp+1.770.077trend
qa3128k27.0 % [±3.9] · N=50022.0 % [±11.5] · N=50, 8 err+5.00 pp+0.810.419ns

Significance via two-sample Wald test for difference of proportions. Modulum sample sizes vary by cell (100 / 200 / 500) due to phased extension runs (qa3 32k/64k still at N=100 — extension in flight). Grok is N=50 across all cells; qa3 128k completed 42 of 50 attempts (8 backend errors). At 32k–64k the stacks are statistically indistinguishable. At 128k Modulum holds a significant lead on qa1 and qa2; qa3 128k is +5 pp in Modulum's favor but within noise (p=0.42).

Modulum's headline advantage: holds ground per doubling of context.

OLS linear fit of accuracy vs log₂(context tokens) across 32k–128k. Negative slope = accuracy decays as context grows. Modulum's slope is flatter than Grok's on every task — most extremely on qa1 retrieval.

qa1
Modulum −8.75 pp
Grok −25.00 pp
+16.25 pp
qa2
Modulum −6.75 pp
Grok −20.00 pp
+13.25 pp
qa3
Modulum −2.50 pp
Grok −8.19 pp
+5.69 pp

Read: Modulum's qa3 slope of −2.5 pp / doubling is the flattest in the full 7-stack comparison (we also tested Opus 4.6 at −4.0, GPT-5.5 at −9.0, Gemini 3.1 Pro at −7.6). On qa1 retrieval Modulum decays at less than half Grok's rate.

Where Grok wins: refusal behavior.

When the model gets a BABILong qa question wrong, what does it emit? We classify each wrong answer as a refusal (output contains "not mentioned" / "is unknown" / "I don't know" / etc.), a pure hallucination (output names no valid location and doesn't refuse), or a committed wrong answer (names a wrong but valid location).

Task @ 128k Stack Wrong N Wrong-output median chars Wrong-output max chars Refusal % Pure hallucination %
qa1Modulum574750010.5 %12.3 %
qa1Grok 4.335456731.4 %14.3 %
qa2Modulum12132730.0 %31.4 %
qa2Grok 4.34127480.0 %14.6 %
qa3Modulum36548640.0 %0.0 %
qa3Grok 4.32748510.0 %0.0 %

Grok's refusal behavior on qa1 128k is the strongest in the panel — 11 of 35 wrong answers (31.4 %) honestly refuse instead of committing. Modulum, by contrast, suppresses fabrication by structural output enforcement (no long PG19-distractor narratives — max 500 chars vs Grok's tighter max 67) but doesn't add a refusal mechanism. On qa2 Grok's pure-hallucination rate is half of Modulum's (14.6 % vs 31.4 %). On qa3 both stacks converge to zero pure hallucination — they commit canonically.

Direct hallucination probe — when the answer isn't in context.

We modify each BABILong prompt to ask about an entity that NEVER appears in the context ("Where is Zelda?" when only John and Mary are in the story). Correct behavior is refusal; commitment is hallucination. Grok 4.3 partial results below — Modulum probe queued behind ongoing N=200 ablation runs; full comparison ships in v5 update.

StackContextNRefusedHallucinatedAmbiguousRefusal rate
Grok 4.332k505000100.0 %
Grok 4.364k45432095.6 %
Grok 4.3128krunning
Modulum32k–128kqueued

Grok 4.3 refuses 100 % at 32k and 95.6 % at 64k on never-seen-entity questions. This is the strongest production-safety signal in the entire 7-stack comparison so far. Grok's refusal calibration is excellent.

Reliability under sustained load.

StackCellErrors / NFailure mode
Grok 4.3qa1 128k1 / 50API error during sustained 128k sequential run
Grok 4.3qa2 64k2 / 50same
Grok 4.3qa2 128k1 / 50same
Grok 4.3qa3 64k3 / 50same
Grok 4.3qa3 128k4 / 32completed only 32/50 due to API errors
Modulumqa1 128k3 / 200HTTP 503 "backend busy" — recovered via retry+backoff

Grok has 11 backend errors across 5 cells (out of ~450 total Grok calls). Modulum has 3 errors all on the qa1 128k cell (the 503-storm) which a retry layer absorbed. Neither stack is fault-free; Grok has slightly higher long-context error rate in this run.

What this dataset suggests is worth looking at.