docs(evidence): 5g.2 LIVE 500-step smoke + methodology audit (2026-05-09)#1578
Merged
Conversation
…-09) Records the first end-to-end LIVE 5g.2 dispatch enabled by §50.4 step 5f.5 (PR #1577). The wireup itself works; the val_loss numerical result is recorded with an honest methodology audit per `feedback_test_methodology_can_fake_bugs.md`. What this evidence proves: - apr pretrain --init Qwen.apr --device cuda runs end-to-end on RTX 4090 (forward + backward + AdamW + checkpoint write). - Wall budget ~40s for 300 steps batch=4 seq=512 (FALSIFY-002). - Checkpoint serializes as valid APR v2 with passing checksum (FALSIFY-004). - No CUDA errors during run (FALSIFY-006). What this evidence does NOT prove (and the README is explicit): - val_loss=0.0008 is implausibly low; FALSIFY-005 is recorded as NUMERICALLY-PASSED-METHODOLOGY-SUSPECT, not DISCHARGED. - MODEL-2 ship % stays at 57% until two follow-up falsifiers bind: H1 (eval_batch correctness) + H2 (populate-tensor coverage). - Inference verification is blocked (saved checkpoint lacks embedded tokenizer; PMAT-172 rejects `apr run`). Five-Whys for the methodology gate: 1. Why not record FALSIFY-005 as DISCHARGED? Industry-baseline val_loss for 0.5B on Python is ~2.0-3.0; reaching 0.0008 in 300 steps is empirically implausible. Per `feedback_test_methodology_can_fake_bugs.md`, single-statistic gates need shape verification before trust. 2. Why two hypotheses (H1 eval bug + H2 populate gap)? The saved checkpoint has 219 tensors; canonical Qwen 0.5B APR has 290. 71 tensors didn't transfer — either the populate helper drops them silently, or the polymorphic Transformer struct doesn't expose them in named_parameters(). Independently, the loss collapse-to-zero shape suggests a degenerate eval_batch path. 3. Why not investigate H1 + H2 in this PR? PR #1577 ships the wireup. That's a clean, atomic, falsifiable change. Investigating H1/H2 needs new falsifiers, new tests, and a re-run — multi-PR scope per `feedback_falsifier_first_cascade_pattern.md`. 4. Why ship the wireup before resolving the val_loss anomaly? The wireup is correct (CUDA + --init no longer fail-fasts; 1-step smoke and 500-step smoke both complete; checkpoint writes correctly). The numerical-correctness question is downstream. Blocking 5f.5 on H1/H2 would conflate "the wireup exists" with "the wireup produces honest verdicts" — they're separate ship gates. 5. Why publish the methodology-suspect evidence instead of waiting? Per spec discipline ("audit-trail amendments preserve cadence"): recording the suspect verdict honestly NOW, with the H1/H2 investigation queued, is more useful than silence. A future agent or operator inspecting `evidence/section-59-...` learns the exact gap and can pick up the investigation without re-deriving it. Quality gates (this PR): - Documentation-only change (no Rust code, no contract YAML). - `pv validate` not exercised (no contract changed). - Evidence pinned at `dispatch.txt` (.log is gitignored; renamed to .txt to track the raw stdout/stderr). SHIP-TWO impact: - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work). - MODEL-2 ship %: unchanged at 57% (val_loss anomaly blocks honest flip; tracked as PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001). - §50.4 cascade: COMPLETE per #1577 (5a-5f.5 all shipped); only 5g.3 verdict (post-anomaly-resolution) remains. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Records the first end-to-end LIVE 5g.2 dispatch enabled by §50.4 step 5f.5 (PR #1577 in flight).
The wireup itself works (forward + backward + AdamW + checkpoint write all run on RTX 4090); the val_loss numerical result is recorded with an honest methodology audit per
feedback_test_methodology_can_fake_bugs.md.MODEL-2 ship % stays at 57% — NOT flipped to ≥58% because val_loss=0.0008 is methodologically suspect and the contract
apr-pretrain-init-finetune-v1.yamlv1.0.0 FALSIFY-005 is recorded asNUMERICALLY-PASSED-METHODOLOGY-SUSPECT, not DISCHARGED.What this evidence proves
apr pretrain --init Qwen.apr --device cudaruns end-to-end on RTX 4090 (FALSIFY-001 exit 0)Why FALSIFY-005 is NOT recorded as DISCHARGED
CudaTransformerTrainer::eval_batchreturns degenerate losspopulate_trainer_from_init_tensorssilently drops 71/290 Qwen tensorsIndustry baseline: SmolLM-360M on 1T tokens trains to val_loss ~2.9; Qwen2.5-Coder-0.5B on Python is typically ~2.0-3.0. A 300-step fine-tune cannot lower val_loss to 0.0008 unless the eval is degenerate or held-out is leaked.
Per
feedback_test_methodology_can_fake_bugs.md: single-statistic gates need shape verification before trust.Five-Whys (methodology decision)
SHIP-TWO impact
Test plan
dispatch.txt(.logis gitignored)Files
evidence/section-59-5g-2-dispatch-2026-05-09/README.md(NEW, methodology audit + verdicts)evidence/section-59-5g-2-dispatch-2026-05-09/dispatch.txt(NEW, raw apr pretrain stdout/stderr)🤖 Generated with Claude Code