Independent benchmark replication of Hypernym's Modulum platform (Gemma-4-31B-Q4 + Modulum components) head-to-head against xAI's Grok 4.3 on BABILong qa1 / qa2 / qa3 at 32k / 64k / 128k context. Sharing the raw ground-truth comparison so xAI can review the data directly. Every number traces to exports/all_results.csv + exports/full_audit.json. Scoring = case-insensitive substring match against the BABILong target token. Temperature = 0 on both stacks (Grok accepts it).
On the 9-cell BABILong matrix, Modulum and Grok 4.3 are statistically indistinguishable across 7 of 9 cells. Modulum holds a real lead on qa1 128k (+41.5 pp, p<0.001) and qa2 128k (+21.5 pp, p<0.001). Everything else — including qa3 128k after Grok's cell filled in — is within noise at the current sample sizes.
The mechanism is clearly visible in the decay slope — Grok 4.3 loses 25 pp / 2× context on qa1 retrieval; Modulum loses 8.75 pp. Same pattern across qa2 and qa3. Grok degrades fast with length; Modulum holds.
On the production-safety axis, Grok 4.3 refuses far more honestly than Modulum — Grok refuses on 31.4 % of wrong qa1 128k answers; Modulum refuses on 10.5 %. On the needle-NOT-in-haystack probe (in flight), Grok refused 50/50 at 32k and 43/45 at 64k. Refusal is a real Grok strength.
Both stacks served the same BABILong prompts via their respective APIs. Modulum via gemma4.hypernym.ai; Grok via api.x.ai/v1/chat/completions at grok-4.3. Wilson 95 % half-widths shown in brackets.
| Task | Length | Modulum acc | Grok 4.3 acc | Δ Mod − Grok | Wald z | p-value | Sig |
|---|---|---|---|---|---|---|---|
| qa1 | 32k | 90.0 % [±4.2] · N=200 | 80.0 % [±10.9] · N=50 | +10.00 pp | +1.66 | 0.098 | near-sig |
| qa1 | 64k | 77.5 % [±5.8] · N=200 | 76.0 % [±11.6] · N=50 | +1.50 pp | +0.22 | 0.823 | ns |
| qa1 | 128k | 71.5 % [±6.2] · N=200 | 30.0 % [±12.3] · N=50 | +41.50 pp | +5.74 | <0.001 | *** |
| qa2 | 32k | 54.0 % [±6.9] · N=200 | 58.0 % [±13.2] · N=50 | −4.00 pp | −0.51 | 0.609 | ns |
| qa2 | 64k | 41.0 % [±6.8] · N=200 | 36.0 % [±12.9] · N=50 | +5.00 pp | +0.66 | 0.512 | ns |
| qa2 | 128k | 39.5 % [±6.7] · N=200 | 18.0 % [±10.5] · N=50 | +21.50 pp | +3.34 | <0.001 | *** |
| qa3 | 32k | 32.0 % [±9.0] · N=100 | 32.0 % [±12.5] · N=50 | +0.00 pp | +0.00 | 1.000 | tie |
| qa3 | 64k | 33.0 % [±9.1] · N=100 | 20.0 % [±10.9] · N=50 | +13.00 pp | +1.77 | 0.077 | trend |
| qa3 | 128k | 27.0 % [±3.9] · N=500 | 22.0 % [±11.5] · N=50, 8 err | +5.00 pp | +0.81 | 0.419 | ns |
Significance via two-sample Wald test for difference of proportions. Modulum sample sizes vary by cell (100 / 200 / 500) due to phased extension runs (qa3 32k/64k still at N=100 — extension in flight). Grok is N=50 across all cells; qa3 128k completed 42 of 50 attempts (8 backend errors). At 32k–64k the stacks are statistically indistinguishable. At 128k Modulum holds a significant lead on qa1 and qa2; qa3 128k is +5 pp in Modulum's favor but within noise (p=0.42).
OLS linear fit of accuracy vs log₂(context tokens) across 32k–128k. Negative slope = accuracy decays as context grows. Modulum's slope is flatter than Grok's on every task — most extremely on qa1 retrieval.
Read: Modulum's qa3 slope of −2.5 pp / doubling is the flattest in the full 7-stack comparison (we also tested Opus 4.6 at −4.0, GPT-5.5 at −9.0, Gemini 3.1 Pro at −7.6). On qa1 retrieval Modulum decays at less than half Grok's rate.
When the model gets a BABILong qa question wrong, what does it emit? We classify each wrong answer as a refusal (output contains "not mentioned" / "is unknown" / "I don't know" / etc.), a pure hallucination (output names no valid location and doesn't refuse), or a committed wrong answer (names a wrong but valid location).
| Task @ 128k | Stack | Wrong N | Wrong-output median chars | Wrong-output max chars | Refusal % | Pure hallucination % |
|---|---|---|---|---|---|---|
| qa1 | Modulum | 57 | 47 | 500 | 10.5 % | 12.3 % |
| qa1 | Grok 4.3 | 35 | 45 | 67 | 31.4 % | 14.3 % |
| qa2 | Modulum | 121 | 32 | 73 | 0.0 % | 31.4 % |
| qa2 | Grok 4.3 | 41 | 27 | 48 | 0.0 % | 14.6 % |
| qa3 | Modulum | 365 | 48 | 64 | 0.0 % | 0.0 % |
| qa3 | Grok 4.3 | 27 | 48 | 51 | 0.0 % | 0.0 % |
Grok's refusal behavior on qa1 128k is the strongest in the panel — 11 of 35 wrong answers (31.4 %) honestly refuse instead of committing. Modulum, by contrast, suppresses fabrication by structural output enforcement (no long PG19-distractor narratives — max 500 chars vs Grok's tighter max 67) but doesn't add a refusal mechanism. On qa2 Grok's pure-hallucination rate is half of Modulum's (14.6 % vs 31.4 %). On qa3 both stacks converge to zero pure hallucination — they commit canonically.
We modify each BABILong prompt to ask about an entity that NEVER appears in the context ("Where is Zelda?" when only John and Mary are in the story). Correct behavior is refusal; commitment is hallucination. Grok 4.3 partial results below — Modulum probe queued behind ongoing N=200 ablation runs; full comparison ships in v5 update.
| Stack | Context | N | Refused | Hallucinated | Ambiguous | Refusal rate |
|---|---|---|---|---|---|---|
| Grok 4.3 | 32k | 50 | 50 | 0 | 0 | 100.0 % |
| Grok 4.3 | 64k | 45 | 43 | 2 | 0 | 95.6 % |
| Grok 4.3 | 128k | running | — | — | — | — |
| Modulum | 32k–128k | queued | — | — | — | — |
Grok 4.3 refuses 100 % at 32k and 95.6 % at 64k on never-seen-entity questions. This is the strongest production-safety signal in the entire 7-stack comparison so far. Grok's refusal calibration is excellent.
| Stack | Cell | Errors / N | Failure mode |
|---|---|---|---|
| Grok 4.3 | qa1 128k | 1 / 50 | API error during sustained 128k sequential run |
| Grok 4.3 | qa2 64k | 2 / 50 | same |
| Grok 4.3 | qa2 128k | 1 / 50 | same |
| Grok 4.3 | qa3 64k | 3 / 50 | same |
| Grok 4.3 | qa3 128k | 4 / 32 | completed only 32/50 due to API errors |
| Modulum | qa1 128k | 3 / 200 | HTTP 503 "backend busy" — recovered via retry+backoff |
Grok has 11 backend errors across 5 cells (out of ~450 total Grok calls). Modulum has 3 errors all on the qa1 128k cell (the 503-storm) which a retry layer absorbed. Neither stack is fault-free; Grok has slightly higher long-context error rate in this run.