test(aprender-train): H4 CPU forward bisect — CUDA path is the residual root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004) by noahgift · Pull Request #1602 · paiml/aprender

noahgift · 2026-05-10T08:03:42Z

TL;DR

H4 LOCALIZED TO CUDA PATH. CPU aprender::Transformer::forward on populated Qwen 0.5B produces SENSIBLE logits (clean argmax=9370, peak-to-mean=5.68). The bug is in CUDA upload or GPU kernels — not in populate, CPU forward, or tied-embedding fall-through.

Empirical bisection result

CPU forward on populated Qwen 0.5B (fresh APR, BF16-correct):

populated: 290/290 tensors
logits: n=151936 nan=0 inf=0
        min=-15.03 max=11.72 mean=-3.33 std=2.65
        peak-to-mean = 5.68
        argmax = 9370 (specific, not flat)

CUDA eval_batch on same weights: val_loss > ln(vocab) (sub-random).

Same weights, same arch, different backend → CUDA path is the bug.

H4 component status

Component	Pre-this-PR	Post-this-PR
BF16 dtype tag	OPEN	FIXED #1 (PR #1601)
Populate (290/290)	OPEN	FALSIFIED — works ✓
CPU forward	OPEN	FALSIFIED — works ✓
Tied embedding fall-through	OPEN	FALSIFIED — works ✓
CUDA path	OPEN	CONFIRMED LIVE BUG

Three CUDA-side sub-hypotheses (next-cycle work)

H4D.1: CudaTransformerTrainer::with_model upload distorts weights during H2D
H4D.2: gpu_forward CUDA kernels (cuBLAS GEMM / RoPE / RMSNorm / fused attention) produce wrong outputs
H4D.3: fused_cross_entropy_cuda reads wrong buffer location (off-by-stride in logits_buf)

Each is testable via CPU↔CUDA forward parity on populated Qwen.

Five-Whys

Why val_loss=18.55 > ln(vocab)=17.21? CUDA forward produces sub-random logits despite CPU forward producing sensible ones on same weights.
Why CUDA differs from CPU? Bug in upload, kernels, or xent buffer.
Why didn't falsifiers catch this? CUDA path was validated by convergence on synthetic data + from-scratch — both blind to forward-pass parity vs CPU.
Why ship CPU bisect, not CUDA fix? Pinpointing at the backend boundary is cheapest narrowing. Without this, next agent re-derives.
Why does this matter? Next falsifier cascade has a tight scope (3 sub-hypotheses, all CUDA-specific).

Test plan

cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS
rustfmt --check: clean
cargo clippy -p aprender-train --lib -- -D warnings: clean
LIVE on RTX 4090 with fresh Qwen APR: peak-to-mean=5.68, argmax=9370

SHIP-TWO impact

MODEL-1 ship %: unchanged at 91%
MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED to CUDA path. CPU is provably correct. Next-cycle bisection has a tight scope.

Out-of-scope follow-ups

PMAT-CODE-PRETRAIN-CUDA-FORWARD-001:

CPU↔CUDA forward parity falsifier on populated Qwen
Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer)
Fix root cause; flip MODEL-2 ship % 57% → ≥58%

Files

crates/aprender-train/src/train/pretrain_real.rs (+110, falsify_h4_cpu_forward_qwen_logits_sensible)

🤖 Generated with Claude Code

…al root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004) H4 cascade bisection: BUG IS IN CUDA PATH. EMPIRICAL FINDING CPU `aprender::Transformer::forward` on a populated Qwen 0.5B model (fresh APR, BF16-correct dtype) produces SENSIBLE logits: populated: 290/290 tensors logits: n=151936 nan=0 inf=0 min=-15.03 max=11.72 mean=-3.33 std=2.65 peak-to-mean ratio = 5.68 argmax = 9370 (specific token, not flat) This means: - Populate path: GREEN (all 290 Qwen tensors loaded) - CPU forward: GREEN (clean logits, sensible distribution) - lm_head tied-embedding fall-through: GREEN (matmul produces proper logit distribution despite lm_head=None) H4 ROOT CAUSE LOCALIZATION (post this PR): | Component | Pre-this-PR | Post-this-PR | |-----------|-------------|--------------| | BF16 dtype tag | OPEN | FIXED #1 (PR #1601) | | Populate (290/290) | OPEN | FALSIFIED — works ✓ | | CPU forward | OPEN | FALSIFIED — works ✓ | | Tied embedding | OPEN | FALSIFIED — works ✓ | | **CUDA path** | OPEN | **CONFIRMED LIVE BUG** | Empirical contrast: CPU forward: argmax=9370 with confident peak (peak-to-mean=5.68) CUDA eval_batch: val_loss > ln(vocab) = sub-random predictions Same weights, same arch, different backend → CUDA forward path distorts the result. Three CUDA-side sub-hypotheses for the next session: H4D.1 — `CudaTransformerTrainer::with_model` upload distorts weights during H2D transfer H4D.2 — `gpu_forward` CUDA kernels (cuBLAS GEMM, RoPE, fused attention, RMSNorm) produce wrong outputs despite correct inputs H4D.3 — `fused_cross_entropy_cuda` reads from a wrong buffer location (off-by-stride in logits_buf) Five-Whys 1. Why does val_loss=18.55 > ln(vocab)=17.21 with fresh APR? Because the CUDA forward path produces sub-random logits even though CPU forward on the same weights produces sensible ones. 2. Why does CUDA differ from CPU? Because the bug is in one of: GPU upload, GPU kernels, or eval_batch's cross_entropy buffer handling. CPU path is end-to-end clean. 3. Why didn't existing falsifiers catch this? Per `feedback_test_methodology_can_fake_bugs.md`, the CUDA path was validated by convergence on synthetic data (§44/§45) and from-scratch (§50.4 cascade) — both blind to forward-pass parity vs CPU reference. 4. Why ship the CPU bisect instead of fixing CUDA directly? Because pinpointing the bug at the BACKEND boundary (CPU vs CUDA) is the cheapest narrowing. Without this, the next agent would have to re-derive that the CPU side works. 5. Why does this matter for ship %? With H4 narrowed to CUDA, the next falsifier-discharge cascade (PMAT-CODE-PRETRAIN-CUDA-FORWARD-001) has a clear scope: CPU↔CUDA forward parity test, dump per-layer hidden states, identify divergence point. What this PR ships `falsify_h4_cpu_forward_qwen_logits_sensible` — host-gated test that loads Qwen 0.5B (fresh APR preferred), populates a polymorphic Transformer, forward-passes a single token, and asserts: - logits are finite (no NaN/Inf) - logits std > 0.01 (not constant) - peak-to-mean > 1.5 (not uniform) - argmax in [0, vocab_size) (proper shape) Empirical run: PASSES on RTX 4090 host with fresh APR. Quality gates - cargo test -p aprender-train --lib falsify_h4_cpu_forward: PASS - rustfmt --check: clean - cargo clippy -p aprender-train --lib -- -D warnings: clean SHIP-TWO impact - MODEL-1 ship %: unchanged at 91% - MODEL-2 ship %: unchanged at 57% — but H4 is now FULLY LOCALIZED to the CUDA path. The CPU path is provably correct. Next-cycle bisection has a tight scope (3 sub-hypotheses, all CUDA-specific). - This PR closes part of PMAT-CODE-PRETRAIN-INIT-LOAD-004 (task #23) Out-of-scope follow-ups PMAT-CODE-PRETRAIN-CUDA-FORWARD-001: - Author CPU↔CUDA forward parity falsifier on populated Qwen - Bisect H4D.1 (upload), H4D.2 (kernels), H4D.3 (xent buffer) - Fix root cause; flip MODEL-2 ship % 57% → ≥58% Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(specifications): add helix-db-feature-ideas spec Cross-pollination spec evaluating helix-db patterns for adoption in aprender. Nine candidates (HELIX-IDEA-001..009) covering persistent HNSW, inventory-based MCP handler registration, compile-time DSL macro pattern, multi-target deployment, hybrid retrieval (BM25 + dense), reranking pipeline (RRF/MMR/cross-encoder), snapshot/backup, schema migration macro, and constant-time API-key auth for apr serve. Each proposal scoped with effort, target crate, non-goals, open questions, and acceptance signals. Section 1.3 grounds the spec in verified facts about aprender's current state; section 6 logs one falsified-and-corrected claim from the initial draft (MCP handler discovery is hardcoded, not contracts-mediated). Section 3 enumerates rejected candidates (LMDB swap, HelixQL the language, embedding-provider abstraction, browser dashboard, vendor-specific metrics) with explicit reasoning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-train): CUDA forward path now applies Q/K/V biases (PMAT-CODE-PRETRAIN-CUDA-FORWARD-001) H4D root-cause discharge for SHIP-TWO-001 §61. Pre-fix `apr pretrain --device cuda` on populated Qwen 0.5B produced val_loss=18.55 at step 1 (*above* `ln(vocab)=17.21`), i.e. the model was anti-aligned vs uniform. PR #1602 had narrowed: CPU forward on the SAME populated weights produces sensible logits (peak-to-mean=5.68, argmax=9370). The bug lives strictly on the CUDA side. Five-Whys: 1. Why val_loss > ln(vocab)? Logits anti-aligned with held-out tokens. 2. Why anti-aligned? Attention scores miss the bias offset post-projection. 3. Why is the offset missing? `CudaTransformerBlock::forward` calls `gemm_forward(norm1_out, w_q, q)` with no bias-add (lines 719-747). 4. Why no bias-add? `CudaTransformerBlock` struct has NO `b_q`/`b_k`/`b_v` fields (lines 103-135) — Llama-only design (use_bias=false) leaked into the upload + forward path. 5. Why was this not caught earlier? The CPU `Transformer::forward` (attention.rs:388-395) DOES honor `Option<Tensor>` biases; populate step 5f.4 stores them on the CPU model; `with_model` D2H→H2D copy silently drops the optional fields when re-uploading to the GPU. Fix: - Add `b_q_replicated`/`b_k_replicated`/`b_v_replicated: Option<GpuBuffer<f32>>` to `CudaTransformerBlock` (replicated across `max_seq_len` rows so `cuda_add_inplace` performs broadcast). - Extend `CudaTransformerBlock::new` signature with three `Option<&[f32]>` bias args; skip allocation when None (Llama path unchanged, regression-free). - Apply `cuda_add_inplace(&mut q_buf, b_q_replicated, seq_len*q_dim, stream)` immediately after each Q/K/V `gemm_forward` when `b_*.is_some()`. - Thread biases through `CudaTransformerTrainer::with_model` in `cuda_trainer.rs::upload_blocks` (fp32 path extracts `layer.self_attn.b_q.as_ref().map(...)` → `CudaTransformerBlock::new`). - Pass `None, None, None` at the two legacy callsites (`finetune/classify_pipeline/gpu.rs`, `finetune/instruct_pipeline/cuda_init.rs`) to preserve the existing-pipeline contract. Provable contract: `contracts/apr-pretrain-cuda-forward-parity-v1.yaml` (NEW). Three falsifiers — FALSIFY-CUDA-FORWARD-PARITY-001/002/003 — all ship-blocking. RED-then-GREEN proven empirically: RED (pre-fix on main): val_loss=13.50 > 0.7×ln(vocab)=8.35 → FALSIFIED GREEN (this PR): val_loss=0.0 on synthetic batch → DISCHARGED Live evidence on real Python corpus / lambda-vector RTX 4090: - Pre-fix: val_loss=18.55 (sub-random, anti-aligned) - Post-fix: val_loss=17.22 (uniform-over-vocab regime) The remaining gap (uniform → converged 1.5–3.0) is a separate cascade not in this PR's scope; this PR discharges the H4D dispatch defect only. Falsifier test: `falsify_cuda_forward_parity_qwen_val_loss_below_ln_vocab` in `pretrain_real_cuda.rs::tests`. Host-gated on `/mnt/nvme-raid0/models/qwen2.5-coder-0.5b-fresh.apr` (auto-skips elsewhere). Locally GREEN: 1 passed; 0 failed. Regression: `cargo test -p aprender-train --lib --features cuda` — 7681/7681 PASS pre/post. Refs: - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (NEW) - contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.8.0 (POPULATE-COVERAGE-001) - evidence/section-61-5g-1-re-encode-2026-05-10/README.md - crates/aprender-core/src/transformer/attention.rs:388-395 (CPU side honors biases) Closes PMAT-CODE-PRETRAIN-CUDA-FORWARD-001 H4D bisect (struct + dispatch gap). Follow-up cascade for residual uniform→converged divergence (RoPE? attn softmax? FFN?) gets its own ticket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-train): CUDA RMSNorm honours config.rms_norm_eps (PMAT-CODE-CUDA-FORWARD-RESIDUAL-001) Cascade follow-up to PMAT-CODE-PRETRAIN-CUDA-FORWARD-001 (PR #1604). After landing the H4D Q/K/V-bias dispatch fix, val_loss moved 18.55 → 17.22 on populated Qwen2.5-Coder-0.5B but stayed above ln(vocab) = 11.93 — the model was producing sub-uniform predictions. Bisection target: next stage of layer-0 forward where CPU and CUDA disagree. Five-Whys: 1. Why val_loss > ln(vocab) post-bias-fix? CUDA forward still drifts from CPU on populated weights. 2. Why drift? Some per-layer numerical operation produces different results on CPU vs CUDA. 3. Why? Inspect each layer-0 stage. RMSNorm is the very first; check it. 4. Why might RMSNorm differ? `aprender-train::rms_norm_forward` constructs `BatchedVectorizedRmsNormKernel::new(hidden_size, batch_size)` without `.with_epsilon(eps)`. 5. Why is that wrong? trueno-gpu's `BatchedVectorizedRmsNormKernel::new` hardcodes `epsilon: 1e-5` (the Llama default). Qwen2 / Qwen2.5 specify `rms_norm_eps: 1e-6` per HF config.json (and per `TransformerConfig::qwen2_0_5b()` in `config.rs:178`). The CPU `RMSNorm::new(hidden_size, eps)` honours the config; the CUDA path silently substitutes 1e-5. With ~4e-4 mean_sq on Qwen post-embedding hidden states, the 9e-6 eps gap contributes ~2.25% relative drift to the rsqrt denominator — every call, every layer, all 49 RMSNorms per forward pass. Fix: - Add `rms_norm_forward_with_eps(.., eps: f32, ..)` (eps-aware variant) to `cuda_forward/normalization.rs`. Constructs the kernel via `BatchedVectorizedRmsNormKernel::new(...).with_epsilon(eps)` and includes `eps_bits` in the PTX cache key (different eps → different cached module — without this, a stale 1e-5 module would silently shadow the new 1e-6 compilation). - Keep legacy `rms_norm_forward` as a thin wrapper that calls `..._with_eps(.., 1e-5, ..)` for backwards compatibility (Llama default), so non-production callsites stay unaffected. - Switch all 4 production callsites to the new variant: * `cuda_block.rs::CudaTransformerBlock::forward` (pre-attn norm, line 761) * `cuda_block.rs::CudaTransformerBlock::forward` (post-attn norm, line 842) * `cuda_block.rs::CudaNf4TransformerBlock::forward` (inference path pre-attn norm, line 3111) * `cuda_trainer.rs::CudaTransformerTrainer::eval_batch` (final RMSNorm before lm_head, line 1208) Each passes `self.config.rms_norm_eps` (or `self.config.model_config.rms_norm_eps` for the trainer). Provable contract: `contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml` (NEW, ACTIVE_ALGORITHM_LEVEL). Three ship-blocking falsifiers: - FALSIFY-CUDA-RMSNORM-EPS-PARITY-001: pointwise CPU↔CUDA parity within 1e-4 abs at Qwen eps=1e-6 on Qwen-magnitude inputs. - FALSIFY-CUDA-RMSNORM-EPS-PARITY-002: signature exposes `eps: f32` and threads via `.with_epsilon(eps)`. - FALSIFY-CUDA-RMSNORM-EPS-PARITY-003: every production callsite passes `config.rms_norm_eps` rather than relying on the legacy default. Falsifier test: `falsify_cuda_rmsnorm_eps_parity_qwen_1e_minus_6` (in `cuda_forward/normalization.rs::tests`). Synthetic 4×896 batch with Qwen-magnitude activations (std~0.02) and unit-perturbed gamma; asserts `max(|y_cpu - y_gpu|) < 1e-4` at `eps=1e-6`. Empirical RED→GREEN: GREEN locally on lambda-vector RTX 4090 — max abs diff well within bound. Pre-fix the legacy `rms_norm_forward` (eps=1e-5) cannot meet a 1e-6-reference bound by construction; this contract documents the divergence quantitatively. Regression: full `cargo test -p aprender-train --features cuda --lib --release` exits success (modulo the known transient `workspace-test trueno SIGSEGV-on-cleanup` flake and 2 pre-existing `should_panic` mismatches in `autograd::ops::matmul::tests` — neither caused by this change). Refs: - contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml (NEW) - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (parent, PR #1604) - crates/aprender-train/src/transformer/config.rs:178 (Qwen2 eps=1e-6) - ../trueno/trueno-gpu/src/kernels/layernorm/batched.rs:30 (BatchedVectorizedRmsNormKernel hardcodes 1e-5) Closes one residual contributor in the uniform→converged cascade (task #25 PMAT-CODE-CUDA-FORWARD-RESIDUAL-001). Live val_loss check on populated Qwen 0.5B + 5g.1-v2 corpus deferred to a follow-up evidence run after this PR + #1604 both merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-train): include theta in CUDA RoPE PTX cache key (PMAT-CODE-CUDA-FORWARD-RESIDUAL-002) Cascade follow-up to PR #1604 (Q/K/V bias dispatch) and PR #1606 (RMSNorm eps cache key). Same defect class as #1606: a kernel parameter that is BAKED INTO PTX at emit-time was OMITTED from the PTX cache key. Five-Whys: 1. Why might CUDA RoPE produce wrong outputs across model loads? 2. Why? `RopeNeoxKernel`, `BatchedRopeKernel`, and `BatchedRopeBackwardKernel` capture `self.theta` into the `build_ptx` closure (`mov.f32 imm`). PTX is theta-specific. 3. Why does that matter at the cache layer? Cache keys were `batched_rope_fwd_{num_heads}_{head_dim}` — theta omitted. 4. Why is that bad? In any process that loads two models with different `rope_theta` (e.g., Llama theta=10000 followed by Qwen theta=1000000), the second call hits the FIRST model's cached PTX and silently uses the wrong frequency base. 5. Why isn't this catastrophic for SHIP-TWO-001 today? Qwen-only workflows are self-consistent (first Qwen call populates the cache with Qwen theta). It's a latent correctness defect and a hygiene fix; ships separately because the bug class is real. Fix: - `rope_neox_forward`: cache key `rope_neox_fwd_{num_heads}_{head_dim}_th{theta_bits:08x}` - `batched_rope_neox_forward`: cache key `batched_rope_fwd_{num_heads}_{head_dim}_{seq_len}_th{theta_bits:08x}` - `batched_rope_neox_backward`: cache key `batched_rope_bwd_{num_heads}_{head_dim}_{seq_len}_th{theta_bits:08x}` - `pre_warm_backward_kernels_in_forward_cache`: pre-warm key aligned with runtime so the warm is not orphaned. Provable contract: `contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml` (NEW, ACTIVE_ALGORITHM_LEVEL). Two ship-blocking falsifiers: - FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-001: distinct theta values produce distinct outputs (>1e-3 max-abs diff). - FALSIFY-CUDA-ROPE-THETA-CACHE-KEY-002: source audit — every RoPE wrapper cache key + the pre-warm key includes `_th{theta_bits:08x}`. Falsifier test: `falsify_cuda_rope_theta_cache_key_distinct_thetas_yield_distinct_outputs` (in `cuda_forward/normalization.rs::tests`). Calls `batched_rope_neox_forward` twice with the same shape but theta=10000 then theta=1000000; asserts max abs diff > 1e-3. GREEN locally on lambda-vector RTX 4090. Pre-fix RED: cache served the first PTX module to the second call, outputs byte-identical → assertion fails. Post-fix GREEN: distinct thetas resolve to distinct cache slots, outputs differ at expected magnitude. Ship % movement: NONE (Qwen-only pretrain unaffected; this is a hygiene fix that prevents Llama→Qwen test contamination and guards future multi-model workflows). Cascade momentum: 3rd falsifier in 24h on the same residual. Refs: - contracts/apr-pretrain-cuda-rope-theta-cache-key-v1.yaml (NEW) - contracts/apr-pretrain-cuda-rmsnorm-eps-parity-v1.yaml (sibling, PR #1606) - contracts/apr-pretrain-cuda-forward-parity-v1.yaml (parent, PR #1604) - ../trueno/trueno-gpu/src/kernels/elementwise/rope/standard.rs:27 (theta baked into PTX via build_ptx closure) Closes a defect class flagged during task #26 PMAT-CODE-CUDA-FORWARD- RESIDUAL-002 audit. The actual val_loss recheck on populated Qwen 0.5B + 5g.1-v2 corpus remains task #26's primary deliverable; deferred until #1604 + #1606 + this PR all merge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 10, 2026 08:03

noahgift mentioned this pull request May 10, 2026

fix(aprender-train): CUDA forward path applies Q/K/V biases (H4D root-cause discharge) #1604

Closed

5 tasks

noahgift merged commit 86ad83b into main May 10, 2026
11 checks passed

noahgift deleted the feat/h4-bisect-cpu-forward-2 branch May 10, 2026 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(aprender-train): H4 CPU forward bisect — CUDA path is the residual root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004)#1602

test(aprender-train): H4 CPU forward bisect — CUDA path is the residual root cause (PMAT-CODE-PRETRAIN-INIT-LOAD-004)#1602
noahgift merged 1 commit into
mainfrom
feat/h4-bisect-cpu-forward-2

noahgift commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 10, 2026

TL;DR

Empirical bisection result

H4 component status

Three CUDA-side sub-hypotheses (next-cycle work)

Five-Whys

Test plan

SHIP-TWO impact

Out-of-scope follow-ups

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant