docs(p3): apr-cli-trace-save-tensor-v1 — contract that unblocks SHIP-007 layer-0 bisection (ships MODEL-1)#1102
Merged
Conversation
Adds `apr code --emit-trace <path>` flag — when set, after the agent loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file to `<path>` describing the run. Format mirrors the schema at https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml § trace_schema. The companion-repo `ccpa measure` subcommand (M26) consumes this file to score apr-code against canonical Claude Code reference fixtures. Records emitted: 1. session_start — synthetic UUIDv7-shaped session_id derived from the start ts; ts is a timestamp string; cwd_sha256 is a 64-char placeholder (the companion-repo differ normalizes these at compare time). 2. user_prompt — turn 0, verbatim text. 3. assistant_turn — turn 1, single Block::Text carrying the agent's final response text. Tool dispatch / hook / skill records are M29+ enrichment follow-ups. 4. session_end — real elapsed_ms + token counts from AgentLoopResult.usage (input_tokens / output_tokens). Real metadata, not stubbed. Plumbing: - commands_enum.rs — new `emit_trace: Option<PathBuf>` field on the Code variant. - dispatch.rs — threads it into batuta::agent::code::cmd_code. - code.rs cmd_code — accepts the new param + plumbs to run_single_prompt. - code.rs run_single_prompt — captures `Instant::now()` at start; after the agent loop returns Ok(r), if the caller passed --emit-trace, calls the new emit_ccpa_trace helper. On write-failure eprintln! a warning but DO NOT fail the agent run. - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls JSONL via serde_json::json! macros (no new dependency on ccpa_trace types). Tests (4 new in code_tests.rs::emit_trace_tests): - emit_writes_4_jsonl_records_with_correct_kinds - emit_carries_prompt_and_response_text - emit_carries_token_counts_and_elapsed - emit_each_record_has_v1_envelope (per-record back-compat invariant from the ccpa-trace v2 schema) Total in agent::code: 50 → 54 tests passing. Live dogfood: $ apr code --emit-trace /tmp/measured.jsonl \ -p "Show me which CLAUDE.md takes precedence right now" $ cat /tmp/measured.jsonl | jq -r '.kind' session_start user_prompt assistant_turn session_end $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")' {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn", "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024} Real elapsed_ms / token counts populated correctly. Note: the response text from Qwen3-1.7B in the dogfood was gibberish (<think>-loop pre-existing aprender concern, see PMAT-190). The trace format is correct; the model behavior is a separate workstream. The emit-trace flag works regardless of model quality. Refs: - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand that consumes this file) - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml § trace_schema (the canonical schema) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the
default model for `apr code` when present. Aligned with the
research write-up at paiml/claude-code-parity-apr / 2026-04-28.
What ships:
configs/aliases.yaml
+ new short name `qwen3-coder` →
hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Now `apr pull qwen3-coder` works.
crates/aprender-registry/src/aliases.rs
+ matching entry in the in-memory AliasRegistry
(kept in sync with configs/aliases.yaml).
crates/aprender-orchestrate/src/agent/manifest.rs
+ `~/.cache/pacha/models/` added to model_search_dirs so
`apr pull`-cached files (content-hashed names) are visible
to discovery; pair with a friendly symlink in
`~/.apr/models/` for the preferred-name filter to recognize.
+ new module-level helper `is_preferred_default_model(path)`:
case-insensitive substring match against a short list of
recommended-default model names. Order:
1. qwen3-coder-30b-a3b
2. qwen3-coder-next
3. qwen2.5-coder-32b
4. qwen2.5-coder-14b
+ discover_model + sort_candidates updated to insert
preferred-name as a sort key BETWEEN validity (still wins
overall) and newest-mtime. So when a small recently-pulled
model exists alongside the recommended default, the
recommended default is selected — fixing the failure mode
where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits
gibberish) was being auto-picked over a known-good 30B model.
Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest):
- preferred_default_recognises_qwen3_coder_30b_a3b
(any-case, any-quant matching)
- preferred_default_rejects_small_fallbacks
(1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is
still useful but we don't anchor it as the recommended-default
family for 24 GB GPUs)
- sort_candidates_promotes_preferred_over_newer
(preferred-name beats newer-but-smaller mtime)
- sort_candidates_newer_preferred_beats_older_preferred
(within preferred-names, mtime still tiebreaks)
- sort_candidates_validity_outranks_preference
(Jidoka — invalid preferred loses to valid non-preferred)
Live verification (this PR):
$ apr pull qwen3-coder
✓ Downloaded successfully
Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
Size: 17.3 GB
$ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \
/home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
$ apr code -p "ping" --max-turns 1
Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered)
↑ default-model preference picked correctly.
Known gap (NOT addressed by this PR):
After auto-discovery picks the model, both apr-serve subprocess
and embedded inference fail with:
Error: driver error: inference failed:
Invalid shape: Tensor 'blk.0.ffn_up.weight' not found
Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses
per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.),
not the dense `ffn_up.weight` the current realizar GGUF loader
expects. qwen3moe architecture support is upstream realizar
work — separate from this PR. The discovery / alias / preferred-
name selection mechanism is fully ready for when that lands.
In the interim users hitting the inference error should fall
back to a dense model — either Qwen2.5-Coder-32B-Instruct
(also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B.
Refs:
- Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28
- Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
- aprender CLAUDE.md § Claude Messages-API proxy spec — same model
is already declared as the default for `apr serve anthropic`
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r capture (unblocks SHIP-007 layer-0 bisection) Triggering observation 2026-04-28: SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward. Aggregate stats — already emitted by `apr trace --payload` — are insufficient since they can hide per-element drift behind similar std values. This contract defines the missing infrastructure: `--save-tensor <stage>` flag that captures raw F32 tensor values at chosen forward-pass stages, written as APRT-magic-prefixed binaries that `apr diff --values` can load directly. ## Stages enumerated (19 total) embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output, final_norm, lm_head ## Falsification tests (8) - 001: --save-tensor flag recognized - 002: determinism (byte-identical across runs) - 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099) - 004: APRT header format self-describing - 005: multi-stage comma-list works - 006: NaN preservation - 007: --layer subset compatible - 008: pv validates `pv validate` exits 0 (verified). ## Implementation cost 400-600 LOC + 8 tests, multi-day Rust task. ## Linkage to shipping MODEL-1 Once shipped, the SHIP-007 layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each stage, pinpoint the first divergent stage as the actual bug surface. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix. With this tooling, the fix is unblocked. paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR and GGUF backends → MODEL-1 completes. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9174fbf to
84fe408
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses across this session (§28 matmul kernel, §28.4(a) q4k_layers populated, §31 qkv_bias values, §32 layer-3 weights byte-identical, #1101 parallel-reduction-nondeterminism). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward.
Aggregate stats — already emitted by
apr trace --payload— are insufficient: they hide per-element drift behind similar std values. This contract defines the missing per-stage tensor capture infrastructure.Linkage to shipping MODEL-1
paiml/qwen2.5-coder-7b-apache-q4k-v1is published but blocked on SHIP-002/005/006/007/008 (5 PARTIALs) which all depend on the SHIP-007 fix. Once this contract's implementation lands, the layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats,apr diffat each of 19 stages, pinpoint the first divergent stage as the actual bug surface, fix at root → 5 PARTIALs flip to DISCHARGED → MODEL-1 ships cleanly through both backends.Contract structure
Status
PROPOSED. Implementation cost: ~400-600 LOC + 8 tests, multi-day Rust task.
Test plan
pv validate contracts/apr-cli-trace-save-tensor-v1.yamlexits 0 (verified live)🤖 Generated with Claude Code