Skip to content

docs(p3): apr-cli-trace-save-tensor-v1 — contract that unblocks SHIP-007 layer-0 bisection (ships MODEL-1)#1102

Merged
noahgift merged 3 commits into
mainfrom
docs/apr-trace-save-tensor-contract
Apr 28, 2026
Merged

docs(p3): apr-cli-trace-save-tensor-v1 — contract that unblocks SHIP-007 layer-0 bisection (ships MODEL-1)#1102
noahgift merged 3 commits into
mainfrom
docs/apr-trace-save-tensor-contract

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses across this session (§28 matmul kernel, §28.4(a) q4k_layers populated, §31 qkv_bias values, §32 layer-3 weights byte-identical, #1101 parallel-reduction-nondeterminism). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward.

Aggregate stats — already emitted by apr trace --payload — are insufficient: they hide per-element drift behind similar std values. This contract defines the missing per-stage tensor capture infrastructure.

Linkage to shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published but blocked on SHIP-002/005/006/007/008 (5 PARTIALs) which all depend on the SHIP-007 fix. Once this contract's implementation lands, the layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each of 19 stages, pinpoint the first divergent stage as the actual bug surface, fix at root → 5 PARTIALs flip to DISCHARGED → MODEL-1 ships cleanly through both backends.

Contract structure

  • 4 equations: cli_signature, byte_format (APRT magic), determinism, apr_diff_values_compat
  • 8 falsification tests covering CLI surface, determinism, expected APR-vs-GGUF diff at ffn_gate, header format, multi-stage, NaN preservation, --layer subset, pv validation
  • 19 named stages enumerated (embedding → lm_head)

Status

PROPOSED. Implementation cost: ~400-600 LOC + 8 tests, multi-day Rust task.

Test plan

  • pv validate contracts/apr-cli-trace-save-tensor-v1.yaml exits 0 (verified live)

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 28, 2026 09:14
noahgift and others added 3 commits April 28, 2026 11:40
Adds `apr code --emit-trace <path>` flag — when set, after the agent
loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file
to `<path>` describing the run.

Format mirrors the schema at
https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml
§ trace_schema. The companion-repo `ccpa measure` subcommand (M26)
consumes this file to score apr-code against canonical Claude Code
reference fixtures.

Records emitted:

  1. session_start  — synthetic UUIDv7-shaped session_id derived from
                      the start ts; ts is a timestamp string;
                      cwd_sha256 is a 64-char placeholder (the
                      companion-repo differ normalizes these at compare
                      time).
  2. user_prompt    — turn 0, verbatim text.
  3. assistant_turn — turn 1, single Block::Text carrying the agent's
                      final response text. Tool dispatch / hook /
                      skill records are M29+ enrichment follow-ups.
  4. session_end    — real elapsed_ms + token counts from
                      AgentLoopResult.usage (input_tokens /
                      output_tokens). Real metadata, not stubbed.

Plumbing:
  - commands_enum.rs   — new `emit_trace: Option<PathBuf>` field on
                         the Code variant.
  - dispatch.rs        — threads it into batuta::agent::code::cmd_code.
  - code.rs cmd_code   — accepts the new param + plumbs to
                         run_single_prompt.
  - code.rs run_single_prompt — captures `Instant::now()` at start;
                         after the agent loop returns Ok(r), if the
                         caller passed --emit-trace, calls the new
                         emit_ccpa_trace helper. On write-failure
                         eprintln! a warning but DO NOT fail the
                         agent run.
  - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls
                         JSONL via serde_json::json! macros (no new
                         dependency on ccpa_trace types).

Tests (4 new in code_tests.rs::emit_trace_tests):
  - emit_writes_4_jsonl_records_with_correct_kinds
  - emit_carries_prompt_and_response_text
  - emit_carries_token_counts_and_elapsed
  - emit_each_record_has_v1_envelope (per-record back-compat
    invariant from the ccpa-trace v2 schema)

Total in agent::code: 50 → 54 tests passing.

Live dogfood:
  $ apr code --emit-trace /tmp/measured.jsonl \
      -p "Show me which CLAUDE.md takes precedence right now"
  $ cat /tmp/measured.jsonl | jq -r '.kind'
    session_start
    user_prompt
    assistant_turn
    session_end
  $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")'
    {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn",
     "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024}

Real elapsed_ms / token counts populated correctly.

Note: the response text from Qwen3-1.7B in the dogfood was gibberish
(<think>-loop pre-existing aprender concern, see PMAT-190). The trace
format is correct; the model behavior is a separate workstream. The
emit-trace flag works regardless of model quality.

Refs:
  - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand
    that consumes this file)
  - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml
    § trace_schema (the canonical schema)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the
default model for `apr code` when present. Aligned with the
research write-up at paiml/claude-code-parity-apr / 2026-04-28.

What ships:

  configs/aliases.yaml
    + new short name `qwen3-coder` →
      hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
    Now `apr pull qwen3-coder` works.

  crates/aprender-registry/src/aliases.rs
    + matching entry in the in-memory AliasRegistry
      (kept in sync with configs/aliases.yaml).

  crates/aprender-orchestrate/src/agent/manifest.rs
    + `~/.cache/pacha/models/` added to model_search_dirs so
      `apr pull`-cached files (content-hashed names) are visible
      to discovery; pair with a friendly symlink in
      `~/.apr/models/` for the preferred-name filter to recognize.
    + new module-level helper `is_preferred_default_model(path)`:
      case-insensitive substring match against a short list of
      recommended-default model names. Order:
        1. qwen3-coder-30b-a3b
        2. qwen3-coder-next
        3. qwen2.5-coder-32b
        4. qwen2.5-coder-14b
    + discover_model + sort_candidates updated to insert
      preferred-name as a sort key BETWEEN validity (still wins
      overall) and newest-mtime. So when a small recently-pulled
      model exists alongside the recommended default, the
      recommended default is selected — fixing the failure mode
      where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits
      gibberish) was being auto-picked over a known-good 30B model.

Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest):
  - preferred_default_recognises_qwen3_coder_30b_a3b
    (any-case, any-quant matching)
  - preferred_default_rejects_small_fallbacks
    (1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is
    still useful but we don't anchor it as the recommended-default
    family for 24 GB GPUs)
  - sort_candidates_promotes_preferred_over_newer
    (preferred-name beats newer-but-smaller mtime)
  - sort_candidates_newer_preferred_beats_older_preferred
    (within preferred-names, mtime still tiebreaks)
  - sort_candidates_validity_outranks_preference
    (Jidoka — invalid preferred loses to valid non-preferred)

Live verification (this PR):

  $ apr pull qwen3-coder
    ✓ Downloaded successfully
      Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
      Size: 17.3 GB

  $ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \
         /home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf

  $ apr code -p "ping" --max-turns 1
    Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered)
    ↑ default-model preference picked correctly.

Known gap (NOT addressed by this PR):

  After auto-discovery picks the model, both apr-serve subprocess
  and embedded inference fail with:

    Error: driver error: inference failed:
           Invalid shape: Tensor 'blk.0.ffn_up.weight' not found

  Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses
  per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.),
  not the dense `ffn_up.weight` the current realizar GGUF loader
  expects. qwen3moe architecture support is upstream realizar
  work — separate from this PR. The discovery / alias / preferred-
  name selection mechanism is fully ready for when that lands.

  In the interim users hitting the inference error should fall
  back to a dense model — either Qwen2.5-Coder-32B-Instruct
  (also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B.

Refs:
  - Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28
  - Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  - aprender CLAUDE.md § Claude Messages-API proxy spec — same model
    is already declared as the default for `apr serve anthropic`

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r capture (unblocks SHIP-007 layer-0 bisection)

Triggering observation 2026-04-28: SHIP-007's hypothesis space has been
narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101).
The remaining bug surface is per-element divergence at some specific
stage of layer-0 forward. Aggregate stats — already emitted by
`apr trace --payload` — are insufficient since they can hide per-element
drift behind similar std values.

This contract defines the missing infrastructure: `--save-tensor <stage>`
flag that captures raw F32 tensor values at chosen forward-pass stages,
written as APRT-magic-prefixed binaries that `apr diff --values` can
load directly.

## Stages enumerated (19 total)

embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope,
attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up,
ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output,
final_norm, lm_head

## Falsification tests (8)

- 001: --save-tensor flag recognized
- 002: determinism (byte-identical across runs)
- 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099)
- 004: APRT header format self-describing
- 005: multi-stage comma-list works
- 006: NaN preservation
- 007: --layer subset compatible
- 008: pv validates

`pv validate` exits 0 (verified).

## Implementation cost

400-600 LOC + 8 tests, multi-day Rust task.

## Linkage to shipping MODEL-1

Once shipped, the SHIP-007 layer-0 bisection completes in one debug
session: run save-tensor in both APR and GGUF formats, apr diff at each
stage, pinpoint the first divergent stage as the actual bug surface.

SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix.
With this tooling, the fix is unblocked.
paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR
and GGUF backends → MODEL-1 completes.

Status: PROPOSED. Implementation deferred to multi-day Rust task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/apr-trace-save-tensor-contract branch from 9174fbf to 84fe408 Compare April 28, 2026 09:40
@noahgift noahgift merged commit 2e003ac into main Apr 28, 2026
10 checks passed
@noahgift noahgift deleted the docs/apr-trace-save-tensor-contract branch April 28, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant