Skip to content

docs(ship-007): determinism test FALSIFIES parallel-reduction-nondeterminism hypothesis#1101

Merged
noahgift merged 1 commit into
mainfrom
docs/ship-007-determinism-falsified
Apr 28, 2026
Merged

docs(ship-007): determinism test FALSIFIES parallel-reduction-nondeterminism hypothesis#1101
noahgift merged 1 commit into
mainfrom
docs/ship-007-determinism-falsified

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Hypothesis from #1099 per-layer analysis: APR's matmul reduction may be parallel (rayon) producing non-deterministic f32 accumulation order. This PR tests it directly.

Result

Run apr forward() twice with identical token input on canonical 7B teacher. Compare logits element-wise:

Metric Value
Total logits 152,064
Non-zero diffs 0 (0.000%)
Max abs diff 0.0000000000
RMS diff 0.0000000000

APR forward is byte-identical across runs. Hypothesis FALSIFIED.

What this means for shipping MODEL-1

The APR vs GGUF gap is structural, not stochastic. Both forward paths are deterministic; they produce different per-element results due to different code paths that compound over layers.

Next investigation step

Compare APR vs GGUF kernel outputs on the SAME synthetic input at layer 0 stage-by-stage. The first stage where APR and GGUF outputs differ at the per-element level (>Q4K tolerance) is the actual bug surface. Likely candidates: RoPE precision, attention softmax order, residual accumulation precision.

Once located, fix at root → SHIP-002/005/006/007/008 (5 PARTIALs) flip to DISCHARGED → MODEL-1 ships cleanly through both APR and GGUF backends → paiml/qwen2.5-coder-7b-apache-q4k-v1 complete.

Files

  • crates/aprender-serve/examples/diag_apr_determinism.rs — re-runnable test
  • evidence/ship-007-layer3-bisection-2026-04-28/diag_apr_determinism.txt — verified live on RTX 4090

Test plan

  • Built with --features cuda from main HEAD
  • Ran twice on canonical 7B teacher
  • Element-wise diff exactly 0.0

🤖 Generated with Claude Code

…rminism hypothesis

Per evidence/ship-007-layer3-bisection-2026-04-28/per-layer-accumulation.md,
hypothesis: APR's matmul reduction may be parallel (rayon) producing
non-deterministic f32 accumulation order vs GGUF's deterministic order.

Test: load canonical 7B teacher, run forward() twice with identical token
input ([3838, 374, 220, 17, 10, 17, 30] for "What is 2+2?"), compare logits
element-wise.

RESULT (152,064 elements):
- Non-zero diffs: 0 (0.000%)
- Max abs diff:   0.0000000000
- RMS diff:       0.0000000000

APR forward is BYTE-IDENTICAL across runs. Hypothesis FALSIFIED.

## What this means for shipping MODEL-1

The APR vs GGUF gap is STRUCTURAL, not stochastic. Both forward paths
are deterministic; they just produce different per-element results due
to different code paths that compound over layers.

## Next investigation step

Compare APR vs GGUF kernel outputs on the SAME synthetic input at layer
0 stage-by-stage:
- Embedding lookup
- RMSNorm output
- QKV matmul + bias
- Per-head RoPE
- Attention (Q×K, softmax, ×V)
- O-projection + residual
- Pre-FFN-norm
- Gate / Up matmul
- silu × multiply
- Down matmul + residual

The first stage where APR and GGUF outputs differ at the per-element
level (>Q4K tolerance) is the actual bug surface. Likely candidates
based on prior evidence: RoPE precision, attention softmax order, or
residual accumulation precision.

Once located, fix at root → 5 SHIP-007 PARTIALs flip to DISCHARGED →
MODEL-1 ships cleanly through both APR and GGUF backends.

Files:
- crates/aprender-serve/examples/diag_apr_determinism.rs
- evidence/ship-007-layer3-bisection-2026-04-28/diag_apr_determinism.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 28, 2026 08:47
@noahgift noahgift merged commit c0b40f6 into main Apr 28, 2026
11 checks passed
@noahgift noahgift deleted the docs/ship-007-determinism-falsified branch April 28, 2026 09:12
noahgift added a commit that referenced this pull request Apr 28, 2026
…r capture (unblocks SHIP-007 layer-0 bisection)

Triggering observation 2026-04-28: SHIP-007's hypothesis space has been
narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101).
The remaining bug surface is per-element divergence at some specific
stage of layer-0 forward. Aggregate stats — already emitted by
`apr trace --payload` — are insufficient since they can hide per-element
drift behind similar std values.

This contract defines the missing infrastructure: `--save-tensor <stage>`
flag that captures raw F32 tensor values at chosen forward-pass stages,
written as APRT-magic-prefixed binaries that `apr diff --values` can
load directly.

## Stages enumerated (19 total)

embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope,
attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up,
ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output,
final_norm, lm_head

## Falsification tests (8)

- 001: --save-tensor flag recognized
- 002: determinism (byte-identical across runs)
- 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099)
- 004: APRT header format self-describing
- 005: multi-stage comma-list works
- 006: NaN preservation
- 007: --layer subset compatible
- 008: pv validates

`pv validate` exits 0 (verified).

## Implementation cost

400-600 LOC + 8 tests, multi-day Rust task.

## Linkage to shipping MODEL-1

Once shipped, the SHIP-007 layer-0 bisection completes in one debug
session: run save-tensor in both APR and GGUF formats, apr diff at each
stage, pinpoint the first divergent stage as the actual bug surface.

SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix.
With this tooling, the fix is unblocked.
paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR
and GGUF backends → MODEL-1 completes.

Status: PROPOSED. Implementation deferred to multi-day Rust task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…007 layer-0 bisection (ships MODEL-1) (#1102)

* feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission)

Adds `apr code --emit-trace <path>` flag — when set, after the agent
loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file
to `<path>` describing the run.

Format mirrors the schema at
https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml
§ trace_schema. The companion-repo `ccpa measure` subcommand (M26)
consumes this file to score apr-code against canonical Claude Code
reference fixtures.

Records emitted:

  1. session_start  — synthetic UUIDv7-shaped session_id derived from
                      the start ts; ts is a timestamp string;
                      cwd_sha256 is a 64-char placeholder (the
                      companion-repo differ normalizes these at compare
                      time).
  2. user_prompt    — turn 0, verbatim text.
  3. assistant_turn — turn 1, single Block::Text carrying the agent's
                      final response text. Tool dispatch / hook /
                      skill records are M29+ enrichment follow-ups.
  4. session_end    — real elapsed_ms + token counts from
                      AgentLoopResult.usage (input_tokens /
                      output_tokens). Real metadata, not stubbed.

Plumbing:
  - commands_enum.rs   — new `emit_trace: Option<PathBuf>` field on
                         the Code variant.
  - dispatch.rs        — threads it into batuta::agent::code::cmd_code.
  - code.rs cmd_code   — accepts the new param + plumbs to
                         run_single_prompt.
  - code.rs run_single_prompt — captures `Instant::now()` at start;
                         after the agent loop returns Ok(r), if the
                         caller passed --emit-trace, calls the new
                         emit_ccpa_trace helper. On write-failure
                         eprintln! a warning but DO NOT fail the
                         agent run.
  - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls
                         JSONL via serde_json::json! macros (no new
                         dependency on ccpa_trace types).

Tests (4 new in code_tests.rs::emit_trace_tests):
  - emit_writes_4_jsonl_records_with_correct_kinds
  - emit_carries_prompt_and_response_text
  - emit_carries_token_counts_and_elapsed
  - emit_each_record_has_v1_envelope (per-record back-compat
    invariant from the ccpa-trace v2 schema)

Total in agent::code: 50 → 54 tests passing.

Live dogfood:
  $ apr code --emit-trace /tmp/measured.jsonl \
      -p "Show me which CLAUDE.md takes precedence right now"
  $ cat /tmp/measured.jsonl | jq -r '.kind'
    session_start
    user_prompt
    assistant_turn
    session_end
  $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")'
    {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn",
     "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024}

Real elapsed_ms / token counts populated correctly.

Note: the response text from Qwen3-1.7B in the dogfood was gibberish
(<think>-loop pre-existing aprender concern, see PMAT-190). The trace
format is correct; the model behavior is a separate workstream. The
emit-trace flag works regardless of model quality.

Refs:
  - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand
    that consumes this file)
  - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml
    § trace_schema (the canonical schema)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-code): default to Qwen3-Coder-30B-A3B-Instruct on 24 GB GPUs

Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the
default model for `apr code` when present. Aligned with the
research write-up at paiml/claude-code-parity-apr / 2026-04-28.

What ships:

  configs/aliases.yaml
    + new short name `qwen3-coder` →
      hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
    Now `apr pull qwen3-coder` works.

  crates/aprender-registry/src/aliases.rs
    + matching entry in the in-memory AliasRegistry
      (kept in sync with configs/aliases.yaml).

  crates/aprender-orchestrate/src/agent/manifest.rs
    + `~/.cache/pacha/models/` added to model_search_dirs so
      `apr pull`-cached files (content-hashed names) are visible
      to discovery; pair with a friendly symlink in
      `~/.apr/models/` for the preferred-name filter to recognize.
    + new module-level helper `is_preferred_default_model(path)`:
      case-insensitive substring match against a short list of
      recommended-default model names. Order:
        1. qwen3-coder-30b-a3b
        2. qwen3-coder-next
        3. qwen2.5-coder-32b
        4. qwen2.5-coder-14b
    + discover_model + sort_candidates updated to insert
      preferred-name as a sort key BETWEEN validity (still wins
      overall) and newest-mtime. So when a small recently-pulled
      model exists alongside the recommended default, the
      recommended default is selected — fixing the failure mode
      where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits
      gibberish) was being auto-picked over a known-good 30B model.

Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest):
  - preferred_default_recognises_qwen3_coder_30b_a3b
    (any-case, any-quant matching)
  - preferred_default_rejects_small_fallbacks
    (1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is
    still useful but we don't anchor it as the recommended-default
    family for 24 GB GPUs)
  - sort_candidates_promotes_preferred_over_newer
    (preferred-name beats newer-but-smaller mtime)
  - sort_candidates_newer_preferred_beats_older_preferred
    (within preferred-names, mtime still tiebreaks)
  - sort_candidates_validity_outranks_preference
    (Jidoka — invalid preferred loses to valid non-preferred)

Live verification (this PR):

  $ apr pull qwen3-coder
    ✓ Downloaded successfully
      Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
      Size: 17.3 GB

  $ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \
         /home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf

  $ apr code -p "ping" --max-turns 1
    Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered)
    ↑ default-model preference picked correctly.

Known gap (NOT addressed by this PR):

  After auto-discovery picks the model, both apr-serve subprocess
  and embedded inference fail with:

    Error: driver error: inference failed:
           Invalid shape: Tensor 'blk.0.ffn_up.weight' not found

  Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses
  per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.),
  not the dense `ffn_up.weight` the current realizar GGUF loader
  expects. qwen3moe architecture support is upstream realizar
  work — separate from this PR. The discovery / alias / preferred-
  name selection mechanism is fully ready for when that lands.

  In the interim users hitting the inference error should fall
  back to a dense model — either Qwen2.5-Coder-32B-Instruct
  (also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B.

Refs:
  - Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28
  - Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  - aprender CLAUDE.md § Claude Messages-API proxy spec — same model
    is already declared as the default for `apr serve anthropic`

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(p3): apr-cli-trace-save-tensor-v1 — contract for per-stage tensor capture (unblocks SHIP-007 layer-0 bisection)

Triggering observation 2026-04-28: SHIP-007's hypothesis space has been
narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101).
The remaining bug surface is per-element divergence at some specific
stage of layer-0 forward. Aggregate stats — already emitted by
`apr trace --payload` — are insufficient since they can hide per-element
drift behind similar std values.

This contract defines the missing infrastructure: `--save-tensor <stage>`
flag that captures raw F32 tensor values at chosen forward-pass stages,
written as APRT-magic-prefixed binaries that `apr diff --values` can
load directly.

## Stages enumerated (19 total)

embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope,
attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up,
ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output,
final_norm, lm_head

## Falsification tests (8)

- 001: --save-tensor flag recognized
- 002: determinism (byte-identical across runs)
- 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099)
- 004: APRT header format self-describing
- 005: multi-stage comma-list works
- 006: NaN preservation
- 007: --layer subset compatible
- 008: pv validates

`pv validate` exits 0 (verified).

## Implementation cost

400-600 LOC + 8 tests, multi-day Rust task.

## Linkage to shipping MODEL-1

Once shipped, the SHIP-007 layer-0 bisection completes in one debug
session: run save-tensor in both APR and GGUF formats, apr diff at each
stage, pinpoint the first divergent stage as the actual bug surface.

SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix.
With this tooling, the fix is unblocked.
paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR
and GGUF backends → MODEL-1 completes.

Status: PROPOSED. Implementation deferred to multi-day Rust task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant