docs(evidence): 5g.2 LIVE 500-step smoke + methodology audit (2026-05-09) by noahgift · Pull Request #1578 · paiml/aprender

noahgift · 2026-05-09T05:30:58Z

Summary

Records the first end-to-end LIVE 5g.2 dispatch enabled by §50.4 step 5f.5 (PR #1577 in flight).

The wireup itself works (forward + backward + AdamW + checkpoint write all run on RTX 4090); the val_loss numerical result is recorded with an honest methodology audit per feedback_test_methodology_can_fake_bugs.md.

MODEL-2 ship % stays at 57% — NOT flipped to ≥58% because val_loss=0.0008 is methodologically suspect and the contract apr-pretrain-init-finetune-v1.yaml v1.0.0 FALSIFY-005 is recorded as NUMERICALLY-PASSED-METHODOLOGY-SUSPECT, not DISCHARGED.

What this evidence proves

✅ apr pretrain --init Qwen.apr --device cuda runs end-to-end on RTX 4090 (FALSIFY-001 exit 0)
✅ Wall ~40s for 300 steps batch=4 seq=512 (FALSIFY-002 wall ≤ 3600s)
✅ Checkpoint serializes as valid APR v2 with passing checksum (FALSIFY-004 magic bytes)
✅ No CUDA errors during run (FALSIFY-006 GPU resource health)
❓ FALSIFY-005 val_loss < 9.38: numerical pass (0.0008 ≪ 9.38) but methodology suspect

Why FALSIFY-005 is NOT recorded as DISCHARGED

Hypothesis	Symptom	Why suspect
H1 — `CudaTransformerTrainer::eval_batch` returns degenerate loss	val_loss collapses to ~zero regardless of input	CUDA eval path may compute differently than CPU path
H2 — `populate_trainer_from_init_tensors` silently drops 71/290 Qwen tensors	Saved checkpoint has 219/290 init tensors	name-mismatch between Transformer struct and qwen2 APR

Industry baseline: SmolLM-360M on 1T tokens trains to val_loss ~2.9; Qwen2.5-Coder-0.5B on Python is typically ~2.0-3.0. A 300-step fine-tune cannot lower val_loss to 0.0008 unless the eval is degenerate or held-out is leaked.

Per feedback_test_methodology_can_fake_bugs.md: single-statistic gates need shape verification before trust.

Five-Whys (methodology decision)

Why not record FALSIFY-005 as DISCHARGED? Industry-baseline val_loss for 0.5B on Python is ~2.0-3.0; reaching 0.0008 in 300 steps is empirically implausible.
Why two hypotheses? The saved checkpoint has 219 tensors; canonical Qwen 0.5B APR has 290. Independently, the loss collapse-to-zero shape suggests a degenerate eval_batch path.
Why not investigate H1 + H2 in this PR? PR atomicity. PR feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577 ships the wireup. Investigating H1/H2 needs new falsifiers, new tests, and a re-run — multi-PR scope.
Why ship the wireup before resolving the val_loss anomaly? The wireup is correct (CUDA + --init no longer fail-fasts; 1-step + 500-step smokes both complete; checkpoint writes correctly). Numerical-correctness is downstream — they're separate ship gates.
Why publish the methodology-suspect evidence instead of waiting? Per spec discipline (audit-trail amendments preserve cadence): recording the suspect verdict honestly NOW, with the H1/H2 investigation queued (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001), is more useful than silence.

SHIP-TWO impact

MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work)
MODEL-2 ship %: unchanged at 57% (val_loss anomaly blocks honest flip)
§50.4 cascade: COMPLETE per feat: §50.4 step 5f.5 CUDA --init wireup (PMAT-CODE-PRETRAIN-INIT-CUDA-WIREUP-001) #1577 (5a-5f.5 all shipped); only 5g.3 verdict (post-anomaly-resolution) remains

Test plan

Documentation-only change (no Rust code, no contract YAML)
Evidence pinned at dispatch.txt (.log is gitignored)
README includes raw run config, falsifier table, methodology audit, follow-up plan

Files

evidence/section-59-5g-2-dispatch-2026-05-09/README.md (NEW, methodology audit + verdicts)
evidence/section-59-5g-2-dispatch-2026-05-09/dispatch.txt (NEW, raw apr pretrain stdout/stderr)

🤖 Generated with Claude Code

…-09) Records the first end-to-end LIVE 5g.2 dispatch enabled by §50.4 step 5f.5 (PR #1577). The wireup itself works; the val_loss numerical result is recorded with an honest methodology audit per `feedback_test_methodology_can_fake_bugs.md`. What this evidence proves: - apr pretrain --init Qwen.apr --device cuda runs end-to-end on RTX 4090 (forward + backward + AdamW + checkpoint write). - Wall budget ~40s for 300 steps batch=4 seq=512 (FALSIFY-002). - Checkpoint serializes as valid APR v2 with passing checksum (FALSIFY-004). - No CUDA errors during run (FALSIFY-006). What this evidence does NOT prove (and the README is explicit): - val_loss=0.0008 is implausibly low; FALSIFY-005 is recorded as NUMERICALLY-PASSED-METHODOLOGY-SUSPECT, not DISCHARGED. - MODEL-2 ship % stays at 57% until two follow-up falsifiers bind: H1 (eval_batch correctness) + H2 (populate-tensor coverage). - Inference verification is blocked (saved checkpoint lacks embedded tokenizer; PMAT-172 rejects `apr run`). Five-Whys for the methodology gate: 1. Why not record FALSIFY-005 as DISCHARGED? Industry-baseline val_loss for 0.5B on Python is ~2.0-3.0; reaching 0.0008 in 300 steps is empirically implausible. Per `feedback_test_methodology_can_fake_bugs.md`, single-statistic gates need shape verification before trust. 2. Why two hypotheses (H1 eval bug + H2 populate gap)? The saved checkpoint has 219 tensors; canonical Qwen 0.5B APR has 290. 71 tensors didn't transfer — either the populate helper drops them silently, or the polymorphic Transformer struct doesn't expose them in named_parameters(). Independently, the loss collapse-to-zero shape suggests a degenerate eval_batch path. 3. Why not investigate H1 + H2 in this PR? PR #1577 ships the wireup. That's a clean, atomic, falsifiable change. Investigating H1/H2 needs new falsifiers, new tests, and a re-run — multi-PR scope per `feedback_falsifier_first_cascade_pattern.md`. 4. Why ship the wireup before resolving the val_loss anomaly? The wireup is correct (CUDA + --init no longer fail-fasts; 1-step smoke and 500-step smoke both complete; checkpoint writes correctly). The numerical-correctness question is downstream. Blocking 5f.5 on H1/H2 would conflate "the wireup exists" with "the wireup produces honest verdicts" — they're separate ship gates. 5. Why publish the methodology-suspect evidence instead of waiting? Per spec discipline ("audit-trail amendments preserve cadence"): recording the suspect verdict honestly NOW, with the H1/H2 investigation queued, is more useful than silence. A future agent or operator inspecting `evidence/section-59-...` learns the exact gap and can pick up the investigation without re-deriving it. Quality gates (this PR): - Documentation-only change (no Rust code, no contract YAML). - `pv validate` not exercised (no contract changed). - Evidence pinned at `dispatch.txt` (.log is gitignored; renamed to .txt to track the raw stdout/stderr). SHIP-TWO impact: - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 work). - MODEL-2 ship %: unchanged at 57% (val_loss anomaly blocks honest flip; tracked as PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001). - §50.4 cascade: COMPLETE per #1577 (5a-5f.5 all shipped); only 5g.3 verdict (post-anomaly-resolution) remains. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 9, 2026 05:31

noahgift mentioned this pull request May 9, 2026

feat(aprender-train): respect config.use_bias in attention constructor (PMAT-CODE-PRETRAIN-INIT-POPULATE-COVERAGE-001) #1579

Merged

9 tasks

Merge branch 'main' into docs/evidence-5g-2-live-smoke

3a8b389

noahgift mentioned this pull request May 9, 2026

docs(evidence): 5g.2 LIVE re-dispatch surfaces H1 eval-batch divergence (PMAT-CODE-PRETRAIN-EVAL-METHODOLOGY-001) #1580

Merged

3 tasks

noahgift merged commit e77dbcb into main May 9, 2026
10 checks passed

noahgift deleted the docs/evidence-5g-2-live-smoke branch May 9, 2026 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evidence): 5g.2 LIVE 500-step smoke + methodology audit (2026-05-09)#1578

docs(evidence): 5g.2 LIVE 500-step smoke + methodology audit (2026-05-09)#1578
noahgift merged 2 commits into
mainfrom
docs/evidence-5g-2-live-smoke

noahgift commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 9, 2026

Summary

What this evidence proves

Why FALSIFY-005 is NOT recorded as DISCHARGED

Five-Whys (methodology decision)

SHIP-TWO impact

Test plan

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant