docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) by noahgift · Pull Request #1580 · paiml/aprender

noahgift · 2026-05-09T06:19:03Z

Summary

Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR #1579's populate-coverage fix applied. The data empirically confirms H1 (eval_batch degenerate) as the dominant remaining defect — H2 (populate gap) was a real fix but was NOT the root cause of the val_loss anomaly.

The smoking gun

At epoch 0 (after 100 training steps), the model has:

Metric	Value	Plausibility
train_loss	1.20 (perplexity 3.33)	✅ PLAUSIBLE for Qwen 0.5B fine-tuning on Python
val_loss	0.00081 (perplexity 1.0008)	❌ Physically IMPOSSIBLE for non-degenerate LM

1500× train/eval discrepancy at the same model state. Same kernel (fused_cross_entropy_cuda), same scaling (1.0/seq_len), same forward path. Different batches, both Python code from the same shards.

H2 was REAL but NOT the dominant cause

Run	train_loss	val_loss	Interpretation
2026-05-08 pre-fix (PR #1578)	0.0019	0.0008	H2 + H1 compounding
2026-05-09 1-step post-fix	2.24	0.628	H2 fixed; H1 still skews val_loss
2026-05-09 500-step post-fix	1.20	0.00081	H2 fixed; H1 dominant

The PR #1579 fix moved train_loss from 0.0019 (degenerate) to 1.20 (plausible) — a 1000× shift confirming structural completeness. But val_loss did NOT shift correspondingly: 0.0008 → 0.00075. Eval pipeline is independent of the populate gap.

Three H1 sub-hypotheses (each its own falsifier-discharge cascade)

A) logits_buf state contamination — train_batch writes gradients in-place (KAIZEN-052); eval_batch's gpu_forward may not fully overwrite, leaving stale gradients that cross_entropy reads as "logits."
B) Stream synchronization — host reads loss_partials before kernel finishes; stream.synchronize() should prevent this but a silent kernel failure could leave the buffer at zero.
C) Held-out batch label corruption — pathological structure where get_target returns same tokens as get_input. Hard to hit by accident on real Python; least likely.

Why ship the evidence + contract bump but not the fix?

PR atomicity (feedback_falsifier_first_cascade_pattern.md). Each H1 sub-hypothesis is its own falsifier-discharge cascade. Shipping the audit trail NOW preserves the discovery for the next session and unblocks the operator from re-deriving it.

Contract bump

contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 → v1.1.0:

status: DRAFT → DRAFT_PARTIAL_DISCHARGE
5/6 falsifiers DISCHARGED, 1/6 NUMERICALLY-PASSED-METHODOLOGY-SUSPECT
Promotion to ACTIVE_RUNTIME requires H1 resolved AND re-dispatch producing val_loss in 1.5-2.5 plausible range

SHIP-TWO impact

MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
MODEL-2 ship %: unchanged at 57% (still gated on honest 5g.3 verdict; this evidence is the audit trail showing why the prior numerical pass was not honest)
§50.4 cascade: COMPLETE per feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577
5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577) with structurally-complete model (PR feat(aprender-train): respect config.use_bias in attention constructor (PMAT-CODE-PRETRAIN-INIT-POPULATE-COVERAGE-001) #1579) but HONEST 5g.3 verdict remains gated on H1 resolution

Test plan

pv validate contracts/apr-pretrain-init-finetune-v1.yaml — 0 errors
Documentation-only change (no Rust code, no falsifier semantics flip)
Evidence pinned at dispatch.txt (.log gitignored)

Files

contracts/apr-pretrain-init-finetune-v1.yaml (v1.0.0 → v1.1.0)
evidence/section-60-5g-2-redispatch-2026-05-09/
- dispatch.txt
- epoch-{000,001,002}.metadata.json
- README.md — H1/H2 hypothesis decomposition + audit
.pv/lint-previous.json (refresh)

Next steps (out-of-scope follow-ups)

PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks:

Author CudaTransformerTrainer::eval_batch sanity-bound test (assert loss > 0.5 on random-init + synthetic batch)
Bisect H1 sub-hypotheses A/B/C with targeted instrumentation
Fix root cause; re-dispatch 5g.2 for honest 5g.3 verdict

🤖 Generated with Claude Code

…ce (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR H1 (eval_batch degenerate) as the dominant remaining defect — H2 (populate gap) was a real fix but was NOT the root cause of the val_loss anomaly. The smoking gun ================ At epoch 0 (after 100 training steps), the model has: train_loss = 1.20 (PLAUSIBLE for Qwen 0.5B fine-tuning on Python) val_loss = 0.00081 (perplexity 1.0008 — physically IMPOSSIBLE for a non-degenerate LM) **1500× train/eval discrepancy at the same model state.** Same kernel (`fused_cross_entropy_cuda`), same scaling (`1.0/seq_len`), same forward path (`gpu_forward` → `gpu_training.logits_buf`). Different batches but both Python code from the same shards. H2 was REAL but NOT the dominant cause ======================================== PR #1579 fixed `MultiHeadAttention::new` to allocate Q/K/V biases when `config.use_bias=true`. The fix moved train_loss from 0.0019 (degenerate, pre-fix) to 1.20 (plausible) — a 1000× shift confirming structural completeness. But val_loss did NOT shift correspondingly: 0.0008 (pre-fix) → 0.00075 (post-fix). The eval pipeline returned essentially the same ~0 number both before and after the H2 fix, indicating H1 is independent of H2. Five-Whys ========= 1. Why is val_loss=0.00075 implausibly low? The model assigns probability ≈0.9992 to every held-out token; physically impossible for an LM that hasn't seen those exact sequences. 2. Why same kernel produces train_loss=1.20 but val_loss=0.00075? The two share the same kernel but differ in something upstream that the kernel reads. 3. Three sub-hypotheses for "something upstream": A) `logits_buf` state contamination — train_batch writes gradients in-place (KAIZEN-052); eval_batch's gpu_forward may not fully overwrite, leaving stale gradients that cross_entropy reads as "logits". B) Stream synchronization — host reads loss_partials before kernel finishes; stream.synchronize() should prevent this but a silent kernel failure could leave the buffer at zero. C) Held-out batch label corruption — pathological structure where get_target returns same tokens as get_input. Hard to hit by accident on real Python; least likely. 4. Why didn't existing falsifiers catch this? The gap is between the kernel-level contract (proven correct in unit tests on synthetic logits) and the high-level dispatch (no falsifier asserts CudaTransformerTrainer::eval_batch produces a loss in a sensible range for known input). H1 is a between-contracts gap, same class as the H2 gap PR #1579 closed. 5. Why ship the evidence + contract bump but not the fix? PR atomicity (`feedback_falsifier_first_cascade_pattern.md`). Each H1 sub-hypothesis (A/B/C) is its own falsifier-discharge cascade. Shipping the audit trail NOW preserves the discovery for the next session and unblocks the operator from re-deriving it. Contract bump ============= `contracts/apr-pretrain-init-finetune-v1.yaml` v1.0.0 → v1.1.0: status: DRAFT → DRAFT_PARTIAL_DISCHARGE Records the 5/6 DISCHARGED + 1/6 NUMERICALLY-PASSED-METHODOLOGY-SUSPECT state. Promotion to ACTIVE_RUNTIME requires H1 resolved AND a re-dispatch producing val_loss in 1.5-2.5 plausible range. SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% (still gated on honest 5g.3 verdict; this evidence is the audit trail showing why the prior numerical pass was not honest) - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) with structurally-complete model (PR #1579) but the HONEST 5g.3 verdict remains gated on H1 resolution Quality gates (this PR) ======================== - pv validate contracts/apr-pretrain-init-finetune-v1.yaml: 0 errors - Documentation-only change (no Rust code, no falsifier semantics flip) - Evidence pinned at dispatch.txt (.log gitignored; renamed) Files ===== - contracts/apr-pretrain-init-finetune-v1.yaml (v1.0.0 → v1.1.0) - evidence/section-60-5g-2-redispatch-2026-05-09/ dispatch.txt epoch-{000,001,002}.metadata.json README.md (H1/H2 hypothesis decomposition + audit) Out-of-scope follow-ups (each its own falsifier-discharge cascade) ================================================================= PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks: - Author CudaTransformerTrainer::eval_batch sanity-bound test (assert loss > 0.5 on random-init + synthetic batch) - Bisect H1 sub-hypotheses A/B/C with targeted instrumentation - Fix root cause; re-dispatch 5g.2 for honest 5g.3 verdict Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… level (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) (#1581) Adds two CUDA-gated falsifier unit tests in pretrain_real_cuda.rs::tests that probe the H1 (eval_batch degenerate) hypothesis surfaced by PR #1580's evidence (1500× train/val discrepancy at the same model state, post H2-fix). Both tests PASS on lambda-vector RTX 4090, EMPIRICALLY FALSIFYING H1 hypothesis A (`logits_buf` train→eval state pollution at the unit-test level). The production bug must therefore be something that does NOT manifest in: - tiny model (2 layers, hidden=64, vocab=1000) - random-init weights (no Qwen pretrained) - synthetic random tokens (no real Python from Qwen tokenizer) - seq_len=16 batches - 1 train_batch step The 1500× discrepancy in production likely requires one of: - real Qwen 0.5B model size + weights - real seq_len=512 batches - real Python tokens (specific tokenizer-vocab patterns) - many train steps (state accumulation effects) - an interaction not captured by unit-level reproducer Five-Whys for landing GREEN falsifiers (rather than waiting for fix): 1. Why ship GREEN falsifiers if they don't reproduce the bug? The tests still prove H1A is FALSIFIED at unit level — that's a real positive contribution to the hypothesis decomposition even though they don't catch the actual production bug. 2. Why isn't this just "wait until you find the bug"? Per `feedback_falsifier_first_cascade_pattern.md`: 1 PR ≈ 1 falsifier discharge. The "H1A falsified at unit level" is itself a discharge. The production-level bug needs a different reproducer (probably a smaller-but-real-Qwen integration test). 3. Why two tests instead of one? - 001 (sanity bound) — checks fresh-init eval_batch returns loss ∈ [0.5, 1.5×ln(vocab)]; catches the simplest H1 form. - 002 (train→eval pollution) — checks eval_batch is not contaminated by train_batch's in-place gradient writeback; directly tests hypothesis A. 4. Why CUDA-gated rather than universal? `CudaTransformerTrainer::new` requires CUDA runtime. The tests run only when the operator (or a CUDA CI lane) explicitly passes `--features cuda`. Default CI sees only the `#[cfg(test)]` mod stub, so no breakage. 5. What does this NOT cover? - H1B (stream sync) — not directly tested; would need a deliberate kernel-failure injection. - H1C (held-out label corruption) — not tested; would need to inspect actual production held_out tokens for pathological patterns. - H1 at production scale — needs an integration test with real Qwen model + real tokens. Test details falsify_eval_batch_h1_sanity_bound: - tiny config (vocab=1000), random init - synthetic batch (4 × 16 tokens, LCG-deterministic) - eval_batch returns loss ≈ ln(1000) = 6.91 - asserts loss ∈ [0.5, 1.5×ln(vocab)] = [0.5, 10.4] - PASSED on RTX 4090 falsify_eval_batch_h1_train_pollution: - same tiny config + random init - two distinct synthetic batches: train_batch_data + eval_batch_data - sequence: eval_batch(eval_data) → train_batch(train_data) → eval_batch(eval_data) - asserts |loss_b - loss_a| / loss_a < 0.95 (1% drop allowed, 1500× drop forbidden — the production observation would correspond to ~99.93% relative drop) - PASSED on RTX 4090 Hypothesis status update | Sub-hypothesis | Pre-this-PR | Post-this-PR | |---|---|---| | H1A (logits_buf train→eval pollution) | OPEN suspected | **FALSIFIED at unit level** | | H1B (stream synchronization) | OPEN | OPEN (not tested) | | H1C (held-out label corruption) | OPEN | OPEN (not tested) | | H1 at production scale | OPEN | OPEN (needs integration test) | The H1A falsification narrows the hypothesis space. Next-cycle falsifiers should target H1B (stream sync) or H1C (held-out content) or full-scale integration with a smaller-but-real Qwen checkpoint. Quality gates - pv validate (no contract change in this PR) - cargo test -p aprender-train --features cuda --lib falsify_eval_batch_h1: 2/2 PASS on RTX 4090 - cargo test -p aprender-train --lib (default features): tests gated out, no CI breakage - rustfmt --check: clean - cargo clippy -p aprender-train --lib -- -D warnings: clean SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% (H1 still open at production scale) - §50.4 cascade: COMPLETE per #1577 - 5g.2 dispatch: OPERATOR-RUNNABLE; HONEST 5g.3 verdict still gated on H1 resolution at production scale Out-of-scope follow-ups (each its own falsifier-discharge cascade) - H1 at production scale: integration test with smaller-but-real Qwen checkpoint + real Python tokens. - H1B stream-sync probe: deliberate kernel-failure injection + loss_partials-buffer state inspection. - H1C held-out content audit: dump first 16 batches of the 5g.1 corpus for pathological patterns (low entropy, repeated tokens). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…PMAT-CODE-TOKENIZE-BPE-FORMAT-001) (#1585) Closes the silent-`<unk>` defect class that produced SHIP-TWO §60's val_loss=0.00081 anomaly recorded in PR #1580. ROOT CAUSE ========== aprender-train's `BPETokenizer::to_bytes` (line 117) emits HEX-string representations: byte 'd' (0x64) → "64", byte 'e' → "65", etc. The loaded vocab.json must have these hex strings as keys for encoding to work. `apr tokenize import-hf` (used by SHIP-TWO §54-§56 step 5g.0 to extract Qwen2.5-Coder-0.5B-Instruct's tokenizer) emits HuggingFace GPT-2 byte-level format: tokens like "Ġdef", "Ġreturn", "def" with Ġ-prefix for spaces and raw characters. **NO hex strings.** When `apr tokenize encode-corpus` then loaded this vocab via `from_vocab_merges`, the load succeeded silently. Subsequent encoding pipeline: 1. `to_bytes("def")` → ["64", "65", "66"] (hex) 2. `apply_merges` looks up these in Qwen vocab — never found 3. `vocab.get("64")` returns None 4. Fallback to `unk_id` (line 275) 5. ALL bytes become `<unk>` Empirical verification (this branch, lambda-vector RTX 4090): - Direct read of /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.bin - First 32K tokens (= 16 batches × 4 sequences × 513 tokens): 99.99% token 128244 (`<unk>`) 0.01% token 128247 (`</s>`) Shannon entropy: 0.001 bits / 17.21 bits theoretical max - All 228 shards confirmed similarly degenerate (~0.003 bits each) Five-Whys ========= 1. Why was val_loss=0.00081 implausibly low (PR #1580)? Because the trained model just learned to predict `<unk>` always — and the held-out batches were 99.99% `<unk>`. cross-entropy on monotonous labels ≈ 0. 2. Why is the corpus 99.99% `<unk>`? Because `apr tokenize encode-corpus` silently emitted `<unk>` for every byte it couldn't find in the loaded vocab. 3. Why couldn't it find anything? Because `to_bytes` produces hex strings ("64") but the Qwen vocab uses GPT-2 byte-level format (raw chars + Ġ-prefix). Format mismatch. 4. Why did the load succeed silently? Because `from_vocab_merges` only checked structural correctness (every merged token in vocab) but NOT format consistency. The vocab format matters because `to_bytes`'s output must match vocab keys. 5. Why didn't existing falsifiers catch this? Because they're between-contracts: `apr-cli-tokenize-import-hf-v1` guarantees import is byte-correct; `pretokenize-bin-v1` guarantees output is u32 stream — but neither pins "encoder's tokenization scheme matches imported vocab's tokenization scheme." Closing that gap with this PR's fail-fast. FIX (smallest viable, fail-fast) ================================= In `BPETokenizer::from_vocab_merges`, after loading vocab.json, count how many of the canonical 256 hex-byte tokens "00".."ff" exist in the vocab. A legitimate hex-byte vocab from `apr tokenize train` always has all 256 (allocated during `init_vocab`). If fewer than 200 are present, the vocab is in the wrong format and the loader returns Err with FALSIFY-BPE-FORMAT-MISMATCH-001 citation, naming the cause and pointing to the canonical fix (implement Ġ-prefix encoding in a follow-up). This is a fail-CLOSED guard: silently corrupting a corpus is worse than refusing to run. The operator now sees a clear actionable error instead of producing a 17-hour broken corpus. LIVE EVIDENCE ============= $ apr tokenize encode-corpus --tokenizer /tmp/qwen-0.5b-tokenizer-extracted ... error: Validation failed: Cannot load tokenizer: Serialization error: FALSIFY-BPE-FORMAT-MISMATCH-001: vocab.json at /tmp/qwen-0.5b-tokenizer-extracted/vocab.json contains only 36/256 canonical hex-byte tokens ("00".."ff"), below the 200 threshold. aprender-train's BPETokenizer uses HEX-BYTE format internally... The exact Qwen vocab that produced the broken 5g.1 corpus now fails-fast on the canonical 36/256 hex-byte signature. Falsifier test ============== `falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast`: - Synthesizes a tiny GPT-2-style vocab.json (raw chars + Ġ-prefix, NO hex bytes) on disk - Calls `BPETokenizer::from_vocab_merges` - Asserts: - result is Err - error message cites "FALSIFY-BPE-FORMAT-MISMATCH-001" - error message mentions "hex-byte" format - error message names `apr tokenize import-hf` (operator diagnostic clarity) RED on main pre-fix; GREEN with this PR. Updated existing test ===================== `test_bpe_from_vocab_merges_rejects_orphan_merge` was implicitly relying on a 3-token vocab; the new fail-fast fires before its orphan-merge check. Updated the test's vocab to include the 256 hex-byte alphabet so the format check passes and the orphan-merge check still fires (existing behavior preserved). Quality gates (all green) ========================== - cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier) - cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASS - cargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASS - cargo clippy -p aprender-train --lib -- -D warnings: clean - cargo check --workspace: clean - rustfmt --check: clean - LIVE: apr tokenize encode-corpus on Qwen vocab fails-fast with clear error (verified on lambda-vector RTX 4090) SHIP-TWO impact ================ - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work) - MODEL-2 ship %: unchanged at 57% — but the path forward is now unblocked. The 5g.1 corpus is INVALID (99.99% `<unk>`); a fix for PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (Ġ-prefix encoding) would let `apr tokenize encode-corpus` produce a real Python corpus, and re-running 5g.1 + 5g.2 would produce HONEST val_loss numbers in the plausible 1.5-2.5 range. - §50.4 cascade: COMPLETE per #1577. The bug surfaced here is upstream in tokenization, not in any §50.4 step. - 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) but the CORRECT-DATA path requires PMAT-CODE-TOKENIZE-BPE-FORMAT-001 to land first. Out-of-scope follow-ups ======================== PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (multi-PR cascade): - Implement Ġ-prefix byte-level encoding in `BPETokenizer` (the canonical fix; ~150 LOC + tests). - OR add a parallel `Gpt2BpeTokenizer` that aprender-train's encode-corpus dispatches to based on vocab format detection. - Re-tokenize the 5g.1 corpus with the working encoder; verify Shannon entropy > 10 bits. - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip MODEL-2 ship % 57% → ≥58%. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 9, 2026 06:19

noahgift mentioned this pull request May 9, 2026

test(aprender-train): H1 falsifiers FALSIFY hypothesis A at unit-test level (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) #1581

Merged

4 tasks

noahgift force-pushed the docs/h1-eval-batch-cuda-divergence-evidence branch from c4aef32 to f8d1a5d Compare May 9, 2026 12:22

noahgift mentioned this pull request May 9, 2026

fix(tokenizer): fail-fast on GPT-2 byte-level vocab format mismatch (PMAT-CODE-TOKENIZE-BPE-FORMAT-001) #1585

Merged

7 tasks

noahgift added 22 commits May 10, 2026 09:11

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

055a210

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

d72cd03

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

23a56da

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

a35e152

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

7333b66

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

bdb5b8d

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

50df55e

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

03a522c

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

68e2713

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

632a2fa

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

e32dd6c

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

b9f0cc8

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

c18e9a2

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

08554b1

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

aa40642

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

86b63b9

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

9d32cdd

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

2570867

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

5e7d042

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

dec858c

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

3f88536

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

83175f4

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

89b9f84

Merge branch 'main' into docs/h1-eval-batch-cuda-divergence-evidence

eea99cf

noahgift merged commit f7a18c0 into main May 13, 2026
10 checks passed

noahgift deleted the docs/h1-eval-batch-cuda-divergence-evidence branch May 13, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)#1580

docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)#1580
noahgift merged 25 commits into
mainfrom
docs/h1-eval-batch-cuda-divergence-evidence

noahgift commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 9, 2026

Summary

The smoking gun

H2 was REAL but NOT the dominant cause

Three H1 sub-hypotheses (each its own falsifier-discharge cascade)

Why ship the evidence + contract bump but not the fix?

Contract bump

SHIP-TWO impact

Test plan

Files

Next steps (out-of-scope follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant