Skip to content

docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)#1580

Merged
noahgift merged 25 commits into
mainfrom
docs/h1-eval-batch-cuda-divergence-evidence
May 13, 2026
Merged

docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)#1580
noahgift merged 25 commits into
mainfrom
docs/h1-eval-batch-cuda-divergence-evidence

Conversation

@noahgift

@noahgift noahgift commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR #1579's populate-coverage fix applied. The data empirically confirms H1 (eval_batch degenerate) as the dominant remaining defect — H2 (populate gap) was a real fix but was NOT the root cause of the val_loss anomaly.

The smoking gun

At epoch 0 (after 100 training steps), the model has:

Metric Value Plausibility
train_loss 1.20 (perplexity 3.33) ✅ PLAUSIBLE for Qwen 0.5B fine-tuning on Python
val_loss 0.00081 (perplexity 1.0008) ❌ Physically IMPOSSIBLE for non-degenerate LM

1500× train/eval discrepancy at the same model state. Same kernel (fused_cross_entropy_cuda), same scaling (1.0/seq_len), same forward path. Different batches, both Python code from the same shards.

H2 was REAL but NOT the dominant cause

Run train_loss val_loss Interpretation
2026-05-08 pre-fix (PR #1578) 0.0019 0.0008 H2 + H1 compounding
2026-05-09 1-step post-fix 2.24 0.628 H2 fixed; H1 still skews val_loss
2026-05-09 500-step post-fix 1.20 0.00081 H2 fixed; H1 dominant

The PR #1579 fix moved train_loss from 0.0019 (degenerate) to 1.20 (plausible) — a 1000× shift confirming structural completeness. But val_loss did NOT shift correspondingly: 0.0008 → 0.00075. Eval pipeline is independent of the populate gap.

Three H1 sub-hypotheses (each its own falsifier-discharge cascade)

  • A) logits_buf state contaminationtrain_batch writes gradients in-place (KAIZEN-052); eval_batch's gpu_forward may not fully overwrite, leaving stale gradients that cross_entropy reads as "logits."
  • B) Stream synchronization — host reads loss_partials before kernel finishes; stream.synchronize() should prevent this but a silent kernel failure could leave the buffer at zero.
  • C) Held-out batch label corruption — pathological structure where get_target returns same tokens as get_input. Hard to hit by accident on real Python; least likely.

Why ship the evidence + contract bump but not the fix?

PR atomicity (feedback_falsifier_first_cascade_pattern.md). Each H1 sub-hypothesis is its own falsifier-discharge cascade. Shipping the audit trail NOW preserves the discovery for the next session and unblocks the operator from re-deriving it.

Contract bump

contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 → v1.1.0:

  • status: DRAFT → DRAFT_PARTIAL_DISCHARGE
  • 5/6 falsifiers DISCHARGED, 1/6 NUMERICALLY-PASSED-METHODOLOGY-SUSPECT
  • Promotion to ACTIVE_RUNTIME requires H1 resolved AND re-dispatch producing val_loss in 1.5-2.5 plausible range

SHIP-TWO impact

Test plan

  • pv validate contracts/apr-pretrain-init-finetune-v1.yaml — 0 errors
  • Documentation-only change (no Rust code, no falsifier semantics flip)
  • Evidence pinned at dispatch.txt (.log gitignored)

Files

  • contracts/apr-pretrain-init-finetune-v1.yaml (v1.0.0 → v1.1.0)
  • evidence/section-60-5g-2-redispatch-2026-05-09/
    • dispatch.txt
    • epoch-{000,001,002}.metadata.json
    • README.md — H1/H2 hypothesis decomposition + audit
  • .pv/lint-previous.json (refresh)

Next steps (out-of-scope follow-ups)

PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks:

  • Author CudaTransformerTrainer::eval_batch sanity-bound test (assert loss > 0.5 on random-init + synthetic batch)
  • Bisect H1 sub-hypotheses A/B/C with targeted instrumentation
  • Fix root cause; re-dispatch 5g.2 for honest 5g.3 verdict

🤖 Generated with Claude Code

…ce (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001)

Records the post-fix LIVE 500-step re-dispatch on RTX 4090 with PR
H1 (eval_batch degenerate) as the dominant remaining defect — H2
(populate gap) was a real fix but was NOT the root cause of the
val_loss anomaly.

The smoking gun
================

At epoch 0 (after 100 training steps), the model has:
  train_loss = 1.20    (PLAUSIBLE for Qwen 0.5B fine-tuning on Python)
  val_loss   = 0.00081 (perplexity 1.0008 — physically IMPOSSIBLE for
                        a non-degenerate LM)

**1500× train/eval discrepancy at the same model state.** Same
kernel (`fused_cross_entropy_cuda`), same scaling (`1.0/seq_len`),
same forward path (`gpu_forward` → `gpu_training.logits_buf`).
Different batches but both Python code from the same shards.

H2 was REAL but NOT the dominant cause
========================================

PR #1579 fixed `MultiHeadAttention::new` to allocate Q/K/V biases
when `config.use_bias=true`. The fix moved train_loss from 0.0019
(degenerate, pre-fix) to 1.20 (plausible) — a 1000× shift confirming
structural completeness.

But val_loss did NOT shift correspondingly: 0.0008 (pre-fix) →
0.00075 (post-fix). The eval pipeline returned essentially the same
~0 number both before and after the H2 fix, indicating H1 is
independent of H2.

Five-Whys
=========

1. Why is val_loss=0.00075 implausibly low? The model assigns
   probability ≈0.9992 to every held-out token; physically
   impossible for an LM that hasn't seen those exact sequences.
2. Why same kernel produces train_loss=1.20 but val_loss=0.00075?
   The two share the same kernel but differ in something upstream
   that the kernel reads.
3. Three sub-hypotheses for "something upstream":
   A) `logits_buf` state contamination — train_batch writes
      gradients in-place (KAIZEN-052); eval_batch's gpu_forward
      may not fully overwrite, leaving stale gradients that
      cross_entropy reads as "logits".
   B) Stream synchronization — host reads loss_partials before
      kernel finishes; stream.synchronize() should prevent this
      but a silent kernel failure could leave the buffer at zero.
   C) Held-out batch label corruption — pathological structure
      where get_target returns same tokens as get_input. Hard
      to hit by accident on real Python; least likely.
4. Why didn't existing falsifiers catch this? The gap is between
   the kernel-level contract (proven correct in unit tests on
   synthetic logits) and the high-level dispatch (no falsifier
   asserts CudaTransformerTrainer::eval_batch produces a loss in
   a sensible range for known input). H1 is a between-contracts
   gap, same class as the H2 gap PR #1579 closed.
5. Why ship the evidence + contract bump but not the fix? PR
   atomicity (`feedback_falsifier_first_cascade_pattern.md`).
   Each H1 sub-hypothesis (A/B/C) is its own falsifier-discharge
   cascade. Shipping the audit trail NOW preserves the discovery
   for the next session and unblocks the operator from re-deriving
   it.

Contract bump
=============

`contracts/apr-pretrain-init-finetune-v1.yaml` v1.0.0 → v1.1.0:
  status: DRAFT → DRAFT_PARTIAL_DISCHARGE
  Records the 5/6 DISCHARGED + 1/6 NUMERICALLY-PASSED-METHODOLOGY-SUSPECT
  state. Promotion to ACTIVE_RUNTIME requires H1 resolved AND a
  re-dispatch producing val_loss in 1.5-2.5 plausible range.

SHIP-TWO impact
================

- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% (still gated on honest 5g.3
  verdict; this evidence is the audit trail showing why the prior
  numerical pass was not honest)
- §50.4 cascade: COMPLETE per #1577
- 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) with
  structurally-complete model (PR #1579) but the HONEST 5g.3
  verdict remains gated on H1 resolution

Quality gates (this PR)
========================

- pv validate contracts/apr-pretrain-init-finetune-v1.yaml: 0 errors
- Documentation-only change (no Rust code, no falsifier semantics flip)
- Evidence pinned at dispatch.txt (.log gitignored; renamed)

Files
=====

- contracts/apr-pretrain-init-finetune-v1.yaml (v1.0.0 → v1.1.0)
- evidence/section-60-5g-2-redispatch-2026-05-09/
    dispatch.txt
    epoch-{000,001,002}.metadata.json
    README.md (H1/H2 hypothesis decomposition + audit)

Out-of-scope follow-ups (each its own falsifier-discharge cascade)
=================================================================

PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001 sub-tasks:
  - Author CudaTransformerTrainer::eval_batch sanity-bound test
    (assert loss > 0.5 on random-init + synthetic batch)
  - Bisect H1 sub-hypotheses A/B/C with targeted instrumentation
  - Fix root cause; re-dispatch 5g.2 for honest 5g.3 verdict

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/h1-eval-batch-cuda-divergence-evidence branch from c4aef32 to f8d1a5d Compare May 9, 2026 12:22
noahgift added a commit that referenced this pull request May 9, 2026
… level (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) (#1581)

Adds two CUDA-gated falsifier unit tests in pretrain_real_cuda.rs::tests
that probe the H1 (eval_batch degenerate) hypothesis surfaced by
PR #1580's evidence (1500× train/val discrepancy at the same model
state, post H2-fix).

Both tests PASS on lambda-vector RTX 4090, EMPIRICALLY FALSIFYING
H1 hypothesis A (`logits_buf` train→eval state pollution at the
unit-test level). The production bug must therefore be something
that does NOT manifest in:
  - tiny model (2 layers, hidden=64, vocab=1000)
  - random-init weights (no Qwen pretrained)
  - synthetic random tokens (no real Python from Qwen tokenizer)
  - seq_len=16 batches
  - 1 train_batch step

The 1500× discrepancy in production likely requires one of:
  - real Qwen 0.5B model size + weights
  - real seq_len=512 batches
  - real Python tokens (specific tokenizer-vocab patterns)
  - many train steps (state accumulation effects)
  - an interaction not captured by unit-level reproducer

Five-Whys for landing GREEN falsifiers (rather than waiting for fix):

1. Why ship GREEN falsifiers if they don't reproduce the bug?
   The tests still prove H1A is FALSIFIED at unit level — that's
   a real positive contribution to the hypothesis decomposition
   even though they don't catch the actual production bug.
2. Why isn't this just "wait until you find the bug"?
   Per `feedback_falsifier_first_cascade_pattern.md`: 1 PR ≈ 1
   falsifier discharge. The "H1A falsified at unit level" is
   itself a discharge. The production-level bug needs a different
   reproducer (probably a smaller-but-real-Qwen integration test).
3. Why two tests instead of one?
   - 001 (sanity bound) — checks fresh-init eval_batch returns
     loss ∈ [0.5, 1.5×ln(vocab)]; catches the simplest H1 form.
   - 002 (train→eval pollution) — checks eval_batch is not
     contaminated by train_batch's in-place gradient writeback;
     directly tests hypothesis A.
4. Why CUDA-gated rather than universal?
   `CudaTransformerTrainer::new` requires CUDA runtime. The tests
   run only when the operator (or a CUDA CI lane) explicitly passes
   `--features cuda`. Default CI sees only the `#[cfg(test)]` mod
   stub, so no breakage.
5. What does this NOT cover?
   - H1B (stream sync) — not directly tested; would need a
     deliberate kernel-failure injection.
   - H1C (held-out label corruption) — not tested; would need to
     inspect actual production held_out tokens for pathological
     patterns.
   - H1 at production scale — needs an integration test with real
     Qwen model + real tokens.

Test details

falsify_eval_batch_h1_sanity_bound:
  - tiny config (vocab=1000), random init
  - synthetic batch (4 × 16 tokens, LCG-deterministic)
  - eval_batch returns loss ≈ ln(1000) = 6.91
  - asserts loss ∈ [0.5, 1.5×ln(vocab)] = [0.5, 10.4]
  - PASSED on RTX 4090

falsify_eval_batch_h1_train_pollution:
  - same tiny config + random init
  - two distinct synthetic batches: train_batch_data + eval_batch_data
  - sequence: eval_batch(eval_data) → train_batch(train_data) → eval_batch(eval_data)
  - asserts |loss_b - loss_a| / loss_a < 0.95 (1% drop allowed,
    1500× drop forbidden — the production observation would
    correspond to ~99.93% relative drop)
  - PASSED on RTX 4090

Hypothesis status update

| Sub-hypothesis | Pre-this-PR | Post-this-PR |
|---|---|---|
| H1A (logits_buf train→eval pollution) | OPEN suspected | **FALSIFIED at unit level** |
| H1B (stream synchronization) | OPEN | OPEN (not tested) |
| H1C (held-out label corruption) | OPEN | OPEN (not tested) |
| H1 at production scale | OPEN | OPEN (needs integration test) |

The H1A falsification narrows the hypothesis space. Next-cycle
falsifiers should target H1B (stream sync) or H1C (held-out
content) or full-scale integration with a smaller-but-real Qwen
checkpoint.

Quality gates

- pv validate (no contract change in this PR)
- cargo test -p aprender-train --features cuda --lib falsify_eval_batch_h1: 2/2 PASS on RTX 4090
- cargo test -p aprender-train --lib (default features): tests gated out, no CI breakage
- rustfmt --check: clean
- cargo clippy -p aprender-train --lib -- -D warnings: clean

SHIP-TWO impact

- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% (H1 still open at production scale)
- §50.4 cascade: COMPLETE per #1577
- 5g.2 dispatch: OPERATOR-RUNNABLE; HONEST 5g.3 verdict still
  gated on H1 resolution at production scale

Out-of-scope follow-ups (each its own falsifier-discharge cascade)

- H1 at production scale: integration test with smaller-but-real
  Qwen checkpoint + real Python tokens.
- H1B stream-sync probe: deliberate kernel-failure injection +
  loss_partials-buffer state inspection.
- H1C held-out content audit: dump first 16 batches of the 5g.1
  corpus for pathological patterns (low entropy, repeated tokens).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added 22 commits May 10, 2026 09:11
noahgift added a commit that referenced this pull request May 13, 2026
…PMAT-CODE-TOKENIZE-BPE-FORMAT-001) (#1585)

Closes the silent-`<unk>` defect class that produced SHIP-TWO §60's
val_loss=0.00081 anomaly recorded in PR #1580.

ROOT CAUSE
==========

aprender-train's `BPETokenizer::to_bytes` (line 117) emits HEX-string
representations: byte 'd' (0x64) → "64", byte 'e' → "65", etc. The
loaded vocab.json must have these hex strings as keys for encoding
to work.

`apr tokenize import-hf` (used by SHIP-TWO §54-§56 step 5g.0 to
extract Qwen2.5-Coder-0.5B-Instruct's tokenizer) emits HuggingFace
GPT-2 byte-level format: tokens like "Ġdef", "Ġreturn", "def" with
Ġ-prefix for spaces and raw characters. **NO hex strings.**

When `apr tokenize encode-corpus` then loaded this vocab via
`from_vocab_merges`, the load succeeded silently. Subsequent
encoding pipeline:
  1. `to_bytes("def")` → ["64", "65", "66"] (hex)
  2. `apply_merges` looks up these in Qwen vocab — never found
  3. `vocab.get("64")` returns None
  4. Fallback to `unk_id` (line 275)
  5. ALL bytes become `<unk>`

Empirical verification (this branch, lambda-vector RTX 4090):
  - Direct read of /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen/shard-00000.bin
  - First 32K tokens (= 16 batches × 4 sequences × 513 tokens):
      99.99% token 128244 (`<unk>`)
      0.01% token 128247 (`</s>`)
      Shannon entropy: 0.001 bits / 17.21 bits theoretical max
  - All 228 shards confirmed similarly degenerate (~0.003 bits each)

Five-Whys
=========

1. Why was val_loss=0.00081 implausibly low (PR #1580)? Because the
   trained model just learned to predict `<unk>` always — and the
   held-out batches were 99.99% `<unk>`. cross-entropy on
   monotonous labels ≈ 0.
2. Why is the corpus 99.99% `<unk>`? Because `apr tokenize
   encode-corpus` silently emitted `<unk>` for every byte it
   couldn't find in the loaded vocab.
3. Why couldn't it find anything? Because `to_bytes` produces hex
   strings ("64") but the Qwen vocab uses GPT-2 byte-level format
   (raw chars + Ġ-prefix). Format mismatch.
4. Why did the load succeed silently? Because `from_vocab_merges`
   only checked structural correctness (every merged token in
   vocab) but NOT format consistency. The vocab format matters
   because `to_bytes`'s output must match vocab keys.
5. Why didn't existing falsifiers catch this? Because they're
   between-contracts: `apr-cli-tokenize-import-hf-v1` guarantees
   import is byte-correct; `pretokenize-bin-v1` guarantees output
   is u32 stream — but neither pins "encoder's tokenization scheme
   matches imported vocab's tokenization scheme." Closing that
   gap with this PR's fail-fast.

FIX (smallest viable, fail-fast)
=================================

In `BPETokenizer::from_vocab_merges`, after loading vocab.json,
count how many of the canonical 256 hex-byte tokens "00".."ff"
exist in the vocab. A legitimate hex-byte vocab from `apr tokenize
train` always has all 256 (allocated during `init_vocab`). If
fewer than 200 are present, the vocab is in the wrong format
and the loader returns Err with FALSIFY-BPE-FORMAT-MISMATCH-001
citation, naming the cause and pointing to the canonical fix
(implement Ġ-prefix encoding in a follow-up).

This is a fail-CLOSED guard: silently corrupting a corpus is
worse than refusing to run. The operator now sees a clear actionable
error instead of producing a 17-hour broken corpus.

LIVE EVIDENCE
=============

  $ apr tokenize encode-corpus --tokenizer /tmp/qwen-0.5b-tokenizer-extracted ...

  error: Validation failed: Cannot load tokenizer: Serialization error:
  FALSIFY-BPE-FORMAT-MISMATCH-001: vocab.json at
  /tmp/qwen-0.5b-tokenizer-extracted/vocab.json contains only 36/256
  canonical hex-byte tokens ("00".."ff"), below the 200 threshold.
  aprender-train's BPETokenizer uses HEX-BYTE format internally...

The exact Qwen vocab that produced the broken 5g.1 corpus now
fails-fast on the canonical 36/256 hex-byte signature.

Falsifier test
==============

`falsify_bpe_format_mismatch_gpt2_vocab_load_fails_fast`:
  - Synthesizes a tiny GPT-2-style vocab.json (raw chars + Ġ-prefix,
    NO hex bytes) on disk
  - Calls `BPETokenizer::from_vocab_merges`
  - Asserts:
    - result is Err
    - error message cites "FALSIFY-BPE-FORMAT-MISMATCH-001"
    - error message mentions "hex-byte" format
    - error message names `apr tokenize import-hf` (operator
      diagnostic clarity)

RED on main pre-fix; GREEN with this PR.

Updated existing test
=====================

`test_bpe_from_vocab_merges_rejects_orphan_merge` was implicitly
relying on a 3-token vocab; the new fail-fast fires before its
orphan-merge check. Updated the test's vocab to include the 256
hex-byte alphabet so the format check passes and the orphan-merge
check still fires (existing behavior preserved).

Quality gates (all green)
==========================

- cargo test -p aprender-train --lib: 7585/7585 PASS (was 7584; +1 falsifier)
- cargo test -p aprender-train --lib bpe_from_vocab_merges: 2/2 PASS
- cargo test -p aprender-train --lib falsify_bpe_format_mismatch: 1/1 PASS
- cargo clippy -p aprender-train --lib -- -D warnings: clean
- cargo check --workspace: clean
- rustfmt --check: clean
- LIVE: apr tokenize encode-corpus on Qwen vocab fails-fast with
  clear error (verified on lambda-vector RTX 4090)

SHIP-TWO impact
================

- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
- MODEL-2 ship %: unchanged at 57% — but the path forward is now
  unblocked. The 5g.1 corpus is INVALID (99.99% `<unk>`); a fix
  for PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (Ġ-prefix encoding) would
  let `apr tokenize encode-corpus` produce a real Python corpus,
  and re-running 5g.1 + 5g.2 would produce HONEST val_loss
  numbers in the plausible 1.5-2.5 range.
- §50.4 cascade: COMPLETE per #1577. The bug surfaced here is
  upstream in tokenization, not in any §50.4 step.
- 5g.2 dispatch: OPERATOR-RUNNABLE end-to-end (PR #1577) but the
  CORRECT-DATA path requires PMAT-CODE-TOKENIZE-BPE-FORMAT-001
  to land first.

Out-of-scope follow-ups
========================

PMAT-CODE-TOKENIZE-BPE-FORMAT-001 (multi-PR cascade):
  - Implement Ġ-prefix byte-level encoding in `BPETokenizer` (the
    canonical fix; ~150 LOC + tests).
  - OR add a parallel `Gpt2BpeTokenizer` that aprender-train's
    encode-corpus dispatches to based on vocab format detection.
  - Re-tokenize the 5g.1 corpus with the working encoder; verify
    Shannon entropy > 10 bits.
  - Re-dispatch 5g.2 LIVE; obtain honest val_loss verdict; flip
    MODEL-2 ship % 57% → ≥58%.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit f7a18c0 into main May 13, 2026
10 checks passed
@noahgift noahgift deleted the docs/h1-eval-batch-cuda-divergence-evidence branch May 13, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant