docs(ship-007): determinism test FALSIFIES parallel-reduction-nondeterminism hypothesis by noahgift · Pull Request #1101 · paiml/aprender

noahgift · 2026-04-28T08:47:07Z

Summary

Hypothesis from #1099 per-layer analysis: APR's matmul reduction may be parallel (rayon) producing non-deterministic f32 accumulation order. This PR tests it directly.

Result

Run apr forward() twice with identical token input on canonical 7B teacher. Compare logits element-wise:

Metric	Value
Total logits	152,064
Non-zero diffs	0 (0.000%)
Max abs diff	0.0000000000
RMS diff	0.0000000000

APR forward is byte-identical across runs. Hypothesis FALSIFIED.

What this means for shipping MODEL-1

The APR vs GGUF gap is structural, not stochastic. Both forward paths are deterministic; they produce different per-element results due to different code paths that compound over layers.

Next investigation step

Compare APR vs GGUF kernel outputs on the SAME synthetic input at layer 0 stage-by-stage. The first stage where APR and GGUF outputs differ at the per-element level (>Q4K tolerance) is the actual bug surface. Likely candidates: RoPE precision, attention softmax order, residual accumulation precision.

Once located, fix at root → SHIP-002/005/006/007/008 (5 PARTIALs) flip to DISCHARGED → MODEL-1 ships cleanly through both APR and GGUF backends → paiml/qwen2.5-coder-7b-apache-q4k-v1 complete.

Files

crates/aprender-serve/examples/diag_apr_determinism.rs — re-runnable test
evidence/ship-007-layer3-bisection-2026-04-28/diag_apr_determinism.txt — verified live on RTX 4090

Test plan

Built with --features cuda from main HEAD
Ran twice on canonical 7B teacher
Element-wise diff exactly 0.0

🤖 Generated with Claude Code

…rminism hypothesis Per evidence/ship-007-layer3-bisection-2026-04-28/per-layer-accumulation.md, hypothesis: APR's matmul reduction may be parallel (rayon) producing non-deterministic f32 accumulation order vs GGUF's deterministic order. Test: load canonical 7B teacher, run forward() twice with identical token input ([3838, 374, 220, 17, 10, 17, 30] for "What is 2+2?"), compare logits element-wise. RESULT (152,064 elements): - Non-zero diffs: 0 (0.000%) - Max abs diff: 0.0000000000 - RMS diff: 0.0000000000 APR forward is BYTE-IDENTICAL across runs. Hypothesis FALSIFIED. ## What this means for shipping MODEL-1 The APR vs GGUF gap is STRUCTURAL, not stochastic. Both forward paths are deterministic; they just produce different per-element results due to different code paths that compound over layers. ## Next investigation step Compare APR vs GGUF kernel outputs on the SAME synthetic input at layer 0 stage-by-stage: - Embedding lookup - RMSNorm output - QKV matmul + bias - Per-head RoPE - Attention (Q×K, softmax, ×V) - O-projection + residual - Pre-FFN-norm - Gate / Up matmul - silu × multiply - Down matmul + residual The first stage where APR and GGUF outputs differ at the per-element level (>Q4K tolerance) is the actual bug surface. Likely candidates based on prior evidence: RoPE precision, attention softmax order, or residual accumulation precision. Once located, fix at root → 5 SHIP-007 PARTIALs flip to DISCHARGED → MODEL-1 ships cleanly through both APR and GGUF backends. Files: - crates/aprender-serve/examples/diag_apr_determinism.rs - evidence/ship-007-layer3-bisection-2026-04-28/diag_apr_determinism.txt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…r capture (unblocks SHIP-007 layer-0 bisection) Triggering observation 2026-04-28: SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward. Aggregate stats — already emitted by `apr trace --payload` — are insufficient since they can hide per-element drift behind similar std values. This contract defines the missing infrastructure: `--save-tensor <stage>` flag that captures raw F32 tensor values at chosen forward-pass stages, written as APRT-magic-prefixed binaries that `apr diff --values` can load directly. ## Stages enumerated (19 total) embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output, final_norm, lm_head ## Falsification tests (8) - 001: --save-tensor flag recognized - 002: determinism (byte-identical across runs) - 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099) - 004: APRT header format self-describing - 005: multi-stage comma-list works - 006: NaN preservation - 007: --layer subset compatible - 008: pv validates `pv validate` exits 0 (verified). ## Implementation cost 400-600 LOC + 8 tests, multi-day Rust task. ## Linkage to shipping MODEL-1 Once shipped, the SHIP-007 layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each stage, pinpoint the first divergent stage as the actual bug surface. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix. With this tooling, the fix is unblocked. paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR and GGUF backends → MODEL-1 completes. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…007 layer-0 bisection (ships MODEL-1) (#1102) * feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission) Adds `apr code --emit-trace <path>` flag — when set, after the agent loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file to `<path>` describing the run. Format mirrors the schema at https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml § trace_schema. The companion-repo `ccpa measure` subcommand (M26) consumes this file to score apr-code against canonical Claude Code reference fixtures. Records emitted: 1. session_start — synthetic UUIDv7-shaped session_id derived from the start ts; ts is a timestamp string; cwd_sha256 is a 64-char placeholder (the companion-repo differ normalizes these at compare time). 2. user_prompt — turn 0, verbatim text. 3. assistant_turn — turn 1, single Block::Text carrying the agent's final response text. Tool dispatch / hook / skill records are M29+ enrichment follow-ups. 4. session_end — real elapsed_ms + token counts from AgentLoopResult.usage (input_tokens / output_tokens). Real metadata, not stubbed. Plumbing: - commands_enum.rs — new `emit_trace: Option<PathBuf>` field on the Code variant. - dispatch.rs — threads it into batuta::agent::code::cmd_code. - code.rs cmd_code — accepts the new param + plumbs to run_single_prompt. - code.rs run_single_prompt — captures `Instant::now()` at start; after the agent loop returns Ok(r), if the caller passed --emit-trace, calls the new emit_ccpa_trace helper. On write-failure eprintln! a warning but DO NOT fail the agent run. - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls JSONL via serde_json::json! macros (no new dependency on ccpa_trace types). Tests (4 new in code_tests.rs::emit_trace_tests): - emit_writes_4_jsonl_records_with_correct_kinds - emit_carries_prompt_and_response_text - emit_carries_token_counts_and_elapsed - emit_each_record_has_v1_envelope (per-record back-compat invariant from the ccpa-trace v2 schema) Total in agent::code: 50 → 54 tests passing. Live dogfood: $ apr code --emit-trace /tmp/measured.jsonl \ -p "Show me which CLAUDE.md takes precedence right now" $ cat /tmp/measured.jsonl | jq -r '.kind' session_start user_prompt assistant_turn session_end $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")' {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn", "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024} Real elapsed_ms / token counts populated correctly. Note: the response text from Qwen3-1.7B in the dogfood was gibberish (<think>-loop pre-existing aprender concern, see PMAT-190). The trace format is correct; the model behavior is a separate workstream. The emit-trace flag works regardless of model quality. Refs: - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand that consumes this file) - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml § trace_schema (the canonical schema) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-code): default to Qwen3-Coder-30B-A3B-Instruct on 24 GB GPUs Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the default model for `apr code` when present. Aligned with the research write-up at paiml/claude-code-parity-apr / 2026-04-28. What ships: configs/aliases.yaml + new short name `qwen3-coder` → hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF Now `apr pull qwen3-coder` works. crates/aprender-registry/src/aliases.rs + matching entry in the in-memory AliasRegistry (kept in sync with configs/aliases.yaml). crates/aprender-orchestrate/src/agent/manifest.rs + `~/.cache/pacha/models/` added to model_search_dirs so `apr pull`-cached files (content-hashed names) are visible to discovery; pair with a friendly symlink in `~/.apr/models/` for the preferred-name filter to recognize. + new module-level helper `is_preferred_default_model(path)`: case-insensitive substring match against a short list of recommended-default model names. Order: 1. qwen3-coder-30b-a3b 2. qwen3-coder-next 3. qwen2.5-coder-32b 4. qwen2.5-coder-14b + discover_model + sort_candidates updated to insert preferred-name as a sort key BETWEEN validity (still wins overall) and newest-mtime. So when a small recently-pulled model exists alongside the recommended default, the recommended default is selected — fixing the failure mode where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits gibberish) was being auto-picked over a known-good 30B model. Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest): - preferred_default_recognises_qwen3_coder_30b_a3b (any-case, any-quant matching) - preferred_default_rejects_small_fallbacks (1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is still useful but we don't anchor it as the recommended-default family for 24 GB GPUs) - sort_candidates_promotes_preferred_over_newer (preferred-name beats newer-but-smaller mtime) - sort_candidates_newer_preferred_beats_older_preferred (within preferred-names, mtime still tiebreaks) - sort_candidates_validity_outranks_preference (Jidoka — invalid preferred loses to valid non-preferred) Live verification (this PR): $ apr pull qwen3-coder ✓ Downloaded successfully Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Size: 17.3 GB $ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \ /home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf $ apr code -p "ping" --max-turns 1 Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered) ↑ default-model preference picked correctly. Known gap (NOT addressed by this PR): After auto-discovery picks the model, both apr-serve subprocess and embedded inference fail with: Error: driver error: inference failed: Invalid shape: Tensor 'blk.0.ffn_up.weight' not found Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.), not the dense `ffn_up.weight` the current realizar GGUF loader expects. qwen3moe architecture support is upstream realizar work — separate from this PR. The discovery / alias / preferred- name selection mechanism is fully ready for when that lands. In the interim users hitting the inference error should fall back to a dense model — either Qwen2.5-Coder-32B-Instruct (also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B. Refs: - Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28 - Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - aprender CLAUDE.md § Claude Messages-API proxy spec — same model is already declared as the default for `apr serve anthropic` Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(p3): apr-cli-trace-save-tensor-v1 — contract for per-stage tensor capture (unblocks SHIP-007 layer-0 bisection) Triggering observation 2026-04-28: SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward. Aggregate stats — already emitted by `apr trace --payload` — are insufficient since they can hide per-element drift behind similar std values. This contract defines the missing infrastructure: `--save-tensor <stage>` flag that captures raw F32 tensor values at chosen forward-pass stages, written as APRT-magic-prefixed binaries that `apr diff --values` can load directly. ## Stages enumerated (19 total) embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output, final_norm, lm_head ## Falsification tests (8) - 001: --save-tensor flag recognized - 002: determinism (byte-identical across runs) - 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099) - 004: APRT header format self-describing - 005: multi-stage comma-list works - 006: NaN preservation - 007: --layer subset compatible - 008: pv validates `pv validate` exits 0 (verified). ## Implementation cost 400-600 LOC + 8 tests, multi-day Rust task. ## Linkage to shipping MODEL-1 Once shipped, the SHIP-007 layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each stage, pinpoint the first divergent stage as the actual bug surface. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix. With this tooling, the fix is unblocked. paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR and GGUF backends → MODEL-1 completes. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 28, 2026 08:47

noahgift merged commit c0b40f6 into main Apr 28, 2026
11 checks passed

noahgift deleted the docs/ship-007-determinism-falsified branch April 28, 2026 09:12

noahgift mentioned this pull request Apr 28, 2026

docs(p3): apr-cli-trace-save-tensor-v1 — contract that unblocks SHIP-007 layer-0 bisection (ships MODEL-1) #1102

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): determinism test FALSIFIES parallel-reduction-nondeterminism hypothesis#1101

docs(ship-007): determinism test FALSIFIES parallel-reduction-nondeterminism hypothesis#1101
noahgift merged 1 commit into
mainfrom
docs/ship-007-determinism-falsified

noahgift commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 28, 2026

Summary

Result

What this means for shipping MODEL-1

Next investigation step

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant