Skip to content

docs(evidence): 5g.2 LIVE 500-step smoke + methodology audit (2026-05-09)#1578

Merged
noahgift merged 2 commits into
mainfrom
docs/evidence-5g-2-live-smoke
May 9, 2026
Merged

docs(evidence): 5g.2 LIVE 500-step smoke + methodology audit (2026-05-09)#1578
noahgift merged 2 commits into
mainfrom
docs/evidence-5g-2-live-smoke

Conversation

@noahgift

@noahgift noahgift commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Records the first end-to-end LIVE 5g.2 dispatch enabled by §50.4 step 5f.5 (PR #1577 in flight).

The wireup itself works (forward + backward + AdamW + checkpoint write all run on RTX 4090); the val_loss numerical result is recorded with an honest methodology audit per feedback_test_methodology_can_fake_bugs.md.

MODEL-2 ship % stays at 57% — NOT flipped to ≥58% because val_loss=0.0008 is methodologically suspect and the contract apr-pretrain-init-finetune-v1.yaml v1.0.0 FALSIFY-005 is recorded as NUMERICALLY-PASSED-METHODOLOGY-SUSPECT, not DISCHARGED.

What this evidence proves

  • apr pretrain --init Qwen.apr --device cuda runs end-to-end on RTX 4090 (FALSIFY-001 exit 0)
  • ✅ Wall ~40s for 300 steps batch=4 seq=512 (FALSIFY-002 wall ≤ 3600s)
  • ✅ Checkpoint serializes as valid APR v2 with passing checksum (FALSIFY-004 magic bytes)
  • ✅ No CUDA errors during run (FALSIFY-006 GPU resource health)
  • ❓ FALSIFY-005 val_loss < 9.38: numerical pass (0.0008 ≪ 9.38) but methodology suspect

Why FALSIFY-005 is NOT recorded as DISCHARGED

Hypothesis Symptom Why suspect
H1 — CudaTransformerTrainer::eval_batch returns degenerate loss val_loss collapses to ~zero regardless of input CUDA eval path may compute differently than CPU path
H2 — populate_trainer_from_init_tensors silently drops 71/290 Qwen tensors Saved checkpoint has 219/290 init tensors name-mismatch between Transformer struct and qwen2 APR

Industry baseline: SmolLM-360M on 1T tokens trains to val_loss ~2.9; Qwen2.5-Coder-0.5B on Python is typically ~2.0-3.0. A 300-step fine-tune cannot lower val_loss to 0.0008 unless the eval is degenerate or held-out is leaked.

Per feedback_test_methodology_can_fake_bugs.md: single-statistic gates need shape verification before trust.

Five-Whys (methodology decision)

  1. Why not record FALSIFY-005 as DISCHARGED? Industry-baseline val_loss for 0.5B on Python is ~2.0-3.0; reaching 0.0008 in 300 steps is empirically implausible.
  2. Why two hypotheses? The saved checkpoint has 219 tensors; canonical Qwen 0.5B APR has 290. Independently, the loss collapse-to-zero shape suggests a degenerate eval_batch path.
  3. Why not investigate H1 + H2 in this PR? PR atomicity. PR feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577 ships the wireup. Investigating H1/H2 needs new falsifiers, new tests, and a re-run — multi-PR scope.
  4. Why ship the wireup before resolving the val_loss anomaly? The wireup is correct (CUDA + --init no longer fail-fasts; 1-step + 500-step smokes both complete; checkpoint writes correctly). Numerical-correctness is downstream — they're separate ship gates.
  5. Why publish the methodology-suspect evidence instead of waiting? Per spec discipline (audit-trail amendments preserve cadence): recording the suspect verdict honestly NOW, with the H1/H2 investigation queued (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001), is more useful than silence.

SHIP-TWO impact

Test plan

  • Documentation-only change (no Rust code, no contract YAML)
  • Evidence pinned at dispatch.txt (.log is gitignored)
  • README includes raw run config, falsifier table, methodology audit, follow-up plan

Files

  • evidence/section-59-5g-2-dispatch-2026-05-09/README.md (NEW, methodology audit + verdicts)
  • evidence/section-59-5g-2-dispatch-2026-05-09/dispatch.txt (NEW, raw apr pretrain stdout/stderr)

🤖 Generated with Claude Code

…-09)

Records the first end-to-end LIVE 5g.2 dispatch enabled by §50.4 step
5f.5 (PR #1577). The wireup itself works; the val_loss numerical
result is recorded with an honest methodology audit per
`feedback_test_methodology_can_fake_bugs.md`.

What this evidence proves:
  - apr pretrain --init Qwen.apr --device cuda runs end-to-end on
    RTX 4090 (forward + backward + AdamW + checkpoint write).
  - Wall budget ~40s for 300 steps batch=4 seq=512 (FALSIFY-002).
  - Checkpoint serializes as valid APR v2 with passing checksum
    (FALSIFY-004).
  - No CUDA errors during run (FALSIFY-006).

What this evidence does NOT prove (and the README is explicit):
  - val_loss=0.0008 is implausibly low; FALSIFY-005 is recorded as
    NUMERICALLY-PASSED-METHODOLOGY-SUSPECT, not DISCHARGED.
  - MODEL-2 ship % stays at 57% until two follow-up falsifiers
    bind: H1 (eval_batch correctness) + H2 (populate-tensor coverage).
  - Inference verification is blocked (saved checkpoint lacks
    embedded tokenizer; PMAT-172 rejects `apr run`).

Five-Whys for the methodology gate:

1. Why not record FALSIFY-005 as DISCHARGED? Industry-baseline
   val_loss for 0.5B on Python is ~2.0-3.0; reaching 0.0008 in
   300 steps is empirically implausible. Per
   `feedback_test_methodology_can_fake_bugs.md`, single-statistic
   gates need shape verification before trust.
2. Why two hypotheses (H1 eval bug + H2 populate gap)? The saved
   checkpoint has 219 tensors; canonical Qwen 0.5B APR has 290.
   71 tensors didn't transfer — either the populate helper drops
   them silently, or the polymorphic Transformer struct doesn't
   expose them in named_parameters(). Independently, the loss
   collapse-to-zero shape suggests a degenerate eval_batch path.
3. Why not investigate H1 + H2 in this PR? PR #1577 ships the
   wireup. That's a clean, atomic, falsifiable change. Investigating
   H1/H2 needs new falsifiers, new tests, and a re-run — multi-PR
   scope per `feedback_falsifier_first_cascade_pattern.md`.
4. Why ship the wireup before resolving the val_loss anomaly? The
   wireup is correct (CUDA + --init no longer fail-fasts; 1-step
   smoke and 500-step smoke both complete; checkpoint writes
   correctly). The numerical-correctness question is downstream.
   Blocking 5f.5 on H1/H2 would conflate "the wireup exists" with
   "the wireup produces honest verdicts" — they're separate ship
   gates.
5. Why publish the methodology-suspect evidence instead of waiting?
   Per spec discipline ("audit-trail amendments preserve cadence"):
   recording the suspect verdict honestly NOW, with the H1/H2
   investigation queued, is more useful than silence. A future
   agent or operator inspecting `evidence/section-59-...` learns
   the exact gap and can pick up the investigation without
   re-deriving it.

Quality gates (this PR):
- Documentation-only change (no Rust code, no contract YAML).
- `pv validate` not exercised (no contract changed).
- Evidence pinned at `dispatch.txt` (.log is gitignored; renamed
  to .txt to track the raw stdout/stderr).

SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work).
- MODEL-2 ship %: unchanged at 57% (val_loss anomaly blocks honest
  flip; tracked as PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001).
- §50.4 cascade: COMPLETE per #1577 (5a-5f.5 all shipped); only
  5g.3 verdict (post-anomaly-resolution) remains.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit e77dbcb into main May 9, 2026
10 checks passed
@noahgift noahgift deleted the docs/evidence-5g-2-live-smoke branch May 9, 2026 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant