feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission)#1100
Closed
noahgift wants to merge 2 commits into
Closed
feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission)#1100noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
noahgift
added a commit
to paiml/claude-code-parity-apr
that referenced
this pull request
Apr 28, 2026
Companion-side bookkeeping for the M28 upstream feature. The apr-cli feature itself lives on a separate aprender branch (feat/apr-code-emit-trace-m28, paiml/aprender#1100). This PR records the launch in the companion contract's status_history audit trail. What landed upstream (paiml/aprender#1100): - new `--emit-trace <path>` flag on `apr code` - 4-record ccpa-trace.jsonl emission after every -p run (session_start + user_prompt + assistant_turn + session_end) - real elapsed_ms + token counts from AgentLoopResult.usage - 4 new unit tests; 50 → 54 passing in agent::code Live dogfood (verified before this PR): $ apr code --emit-trace /tmp/measured.jsonl -p "..." → 4-line valid ccpa-trace.jsonl → elapsed_ms=3295, tokens_in=44, tokens_out=1024 populated correctly Tool dispatch / hook event / skill invocation records remain M29+ enrichment follow-ups (text-only path is what M28 ships). Contract bump v1.15.0 → v1.16.0: - status field annotated with the M28 launch - status_history M28 entry detailing what shipped, dogfood result, and what remains for M29+ - aprender contract-mirror at byte-identical commit 8549cdc69 - pin.lock refreshed (sha256 e979ddfd...) Gates (all green locally): pv validate / pv lint PASS pmat comply check (is_compliant) true, 0 Fail, 12 advisory Warn cargo test --workspace all pass (0 new tests companion-side) scripts/pin-check.sh sha256 matches scripts/pin-check-roundtrip.sh byte-identical to aprender@8549cdc69 Refs: paiml/aprender#1100 (M28 upstream PR) contracts/claude-code-parity-apr-v1.yaml § status_history (M28) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Adds `apr code --emit-trace <path>` flag — when set, after the agent loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file to `<path>` describing the run. Format mirrors the schema at https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml § trace_schema. The companion-repo `ccpa measure` subcommand (M26) consumes this file to score apr-code against canonical Claude Code reference fixtures. Records emitted: 1. session_start — synthetic UUIDv7-shaped session_id derived from the start ts; ts is a timestamp string; cwd_sha256 is a 64-char placeholder (the companion-repo differ normalizes these at compare time). 2. user_prompt — turn 0, verbatim text. 3. assistant_turn — turn 1, single Block::Text carrying the agent's final response text. Tool dispatch / hook / skill records are M29+ enrichment follow-ups. 4. session_end — real elapsed_ms + token counts from AgentLoopResult.usage (input_tokens / output_tokens). Real metadata, not stubbed. Plumbing: - commands_enum.rs — new `emit_trace: Option<PathBuf>` field on the Code variant. - dispatch.rs — threads it into batuta::agent::code::cmd_code. - code.rs cmd_code — accepts the new param + plumbs to run_single_prompt. - code.rs run_single_prompt — captures `Instant::now()` at start; after the agent loop returns Ok(r), if the caller passed --emit-trace, calls the new emit_ccpa_trace helper. On write-failure eprintln! a warning but DO NOT fail the agent run. - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls JSONL via serde_json::json! macros (no new dependency on ccpa_trace types). Tests (4 new in code_tests.rs::emit_trace_tests): - emit_writes_4_jsonl_records_with_correct_kinds - emit_carries_prompt_and_response_text - emit_carries_token_counts_and_elapsed - emit_each_record_has_v1_envelope (per-record back-compat invariant from the ccpa-trace v2 schema) Total in agent::code: 50 → 54 tests passing. Live dogfood: $ apr code --emit-trace /tmp/measured.jsonl \ -p "Show me which CLAUDE.md takes precedence right now" $ cat /tmp/measured.jsonl | jq -r '.kind' session_start user_prompt assistant_turn session_end $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")' {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn", "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024} Real elapsed_ms / token counts populated correctly. Note: the response text from Qwen3-1.7B in the dogfood was gibberish (<think>-loop pre-existing aprender concern, see PMAT-190). The trace format is correct; the model behavior is a separate workstream. The emit-trace flag works regardless of model quality. Refs: - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand that consumes this file) - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml § trace_schema (the canonical schema) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the
default model for `apr code` when present. Aligned with the
research write-up at paiml/claude-code-parity-apr / 2026-04-28.
What ships:
configs/aliases.yaml
+ new short name `qwen3-coder` →
hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Now `apr pull qwen3-coder` works.
crates/aprender-registry/src/aliases.rs
+ matching entry in the in-memory AliasRegistry
(kept in sync with configs/aliases.yaml).
crates/aprender-orchestrate/src/agent/manifest.rs
+ `~/.cache/pacha/models/` added to model_search_dirs so
`apr pull`-cached files (content-hashed names) are visible
to discovery; pair with a friendly symlink in
`~/.apr/models/` for the preferred-name filter to recognize.
+ new module-level helper `is_preferred_default_model(path)`:
case-insensitive substring match against a short list of
recommended-default model names. Order:
1. qwen3-coder-30b-a3b
2. qwen3-coder-next
3. qwen2.5-coder-32b
4. qwen2.5-coder-14b
+ discover_model + sort_candidates updated to insert
preferred-name as a sort key BETWEEN validity (still wins
overall) and newest-mtime. So when a small recently-pulled
model exists alongside the recommended default, the
recommended default is selected — fixing the failure mode
where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits
gibberish) was being auto-picked over a known-good 30B model.
Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest):
- preferred_default_recognises_qwen3_coder_30b_a3b
(any-case, any-quant matching)
- preferred_default_rejects_small_fallbacks
(1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is
still useful but we don't anchor it as the recommended-default
family for 24 GB GPUs)
- sort_candidates_promotes_preferred_over_newer
(preferred-name beats newer-but-smaller mtime)
- sort_candidates_newer_preferred_beats_older_preferred
(within preferred-names, mtime still tiebreaks)
- sort_candidates_validity_outranks_preference
(Jidoka — invalid preferred loses to valid non-preferred)
Live verification (this PR):
$ apr pull qwen3-coder
✓ Downloaded successfully
Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
Size: 17.3 GB
$ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \
/home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
$ apr code -p "ping" --max-turns 1
Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered)
↑ default-model preference picked correctly.
Known gap (NOT addressed by this PR):
After auto-discovery picks the model, both apr-serve subprocess
and embedded inference fail with:
Error: driver error: inference failed:
Invalid shape: Tensor 'blk.0.ffn_up.weight' not found
Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses
per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.),
not the dense `ffn_up.weight` the current realizar GGUF loader
expects. qwen3moe architecture support is upstream realizar
work — separate from this PR. The discovery / alias / preferred-
name selection mechanism is fully ready for when that lands.
In the interim users hitting the inference error should fall
back to a dense model — either Qwen2.5-Coder-32B-Instruct
(also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B.
Refs:
- Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28
- Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
- aprender CLAUDE.md § Claude Messages-API proxy spec — same model
is already declared as the default for `apr serve anthropic`
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
da4ada6 to
b8a6495
Compare
Contributor
Author
|
Superseded by #1102 which landed the same emit-trace + default-model code. Closing to consolidate. |
auto-merge was automatically disabled
April 28, 2026 10:07
Pull request was closed
Merged
5 tasks
noahgift
added a commit
that referenced
this pull request
Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127) ## What ships Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` — F-QW3-MOE-C22214-001, an integration test that invokes the user-facing `apr` binary as a subprocess and asserts: 1. exit 0 2. stdout contains ≥1 non-whitespace character against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf with a fresh date-tagged prompt. This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126, squash a902eea) in CI / regression-prevention. Without it, a future regression that re-routed qwen3_moe back to the dense `run_gguf_generate` path (which produces garbage on MoE weights) would slip through CI silently — there'd be no signal at the `apr run` user-facing surface. ## Live evidence (lambda-vector RTX 4090, 2026-04-29) ``` running 1 test test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ... F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-C22214-001: elapsed = 130.945370974s stdout (first 200B): === APR Run === Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Output: . Completed in 130.83s (cached) stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' F-QW3-MOE-C22214-001: PASS ok test result: ok. 1 passed; 0 failed; 0 ignored ``` Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This test asserts ONLY emit/exit-0 — the discharge gate for FALSIFY-QW3-MOE-FORWARD-003. ## Skip path CI runners (and any host without the cached GGUF) print: F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...] and return success. Same skip pattern as `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (M32c.2.2.2.1.1 in-process forward primitive). ## Contract chain status M32a qwen3-moe-forward-v1 contract scaffold SHIPPED (#1099) M32b arch-aware FFN load refuses qwen3_moe SHIPPED (#1100) M32c.1+ MoE descriptor load + per-expert byte slicer SHIPPED M32c.2.2.2.1.1 forward_qwen3_moe method SHIPPED (#1124) M32c.2.2.2.1.2 run_qwen3_moe_generate function SHIPPED (#1125) M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch SHIPPED (#1126) M32c.2.2.2.1.4 live `apr run` falsifier THIS PR M32d numerical parity vs llama.cpp PENDING After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
apr code --emit-trace <path>flag — when set, after the agent loop completes the runtime writes a 4-recordccpa-trace.jsonlfile describing the run.Format mirrors the schema at paiml/claude-code-parity-apr / contracts/claude-code-parity-apr-v1.yaml § trace_schema. The companion-repo
ccpa measuresubcommand (paiml/claude-code-parity-apr#31, M26) consumes this file to score apr-code against canonical Claude Code reference fixtures.Records emitted
session_startuser_promptassistant_turnsession_endTool dispatch / hook / skill records are M29+ enrichment follow-ups.
Plumbing
commands_enum.rs— newemit_trace: Option<PathBuf>field on theCodevariantdispatch.rs— threads it intobatuta::agent::code::cmd_codecode.rs cmd_code— accepts the new param + plumbs torun_single_promptcode.rs run_single_prompt— capturesInstant::now()at start; after the agent loop returns Ok(r), if--emit-tracewas set, calls the new helper. Write failureseprintln!a warning but do NOT fail the agent run.code.rs emit_ccpa_trace— new ~85 LOC helper that hand-rolls JSONL viaserde_json::json!macros (no new dependency on ccpa_trace types).Tests
4 new in
code_tests.rs::emit_trace_tests(50 → 54 passing inagent::code):emit_writes_4_jsonl_records_with_correct_kindsemit_carries_prompt_and_response_textemit_carries_token_counts_and_elapsedemit_each_record_has_v1_envelope(per-record back-compat invariant from ccpa-trace v2)Live dogfood
Real
elapsed_ms/ token counts populated correctly. The response text was gibberish in the dogfood because Qwen3-1.7B is hitting<think>-loop issues (PMAT-190 — pre-existing aprender concern, separate workstream). The trace format is correct; the model behavior is unrelated.What I checked
cargo build -p apr-cli --features code— cleancargo fmt --check— clean on changed cratescargo test -p aprender-orchestrate --lib agent::code— 54 passingWhy now
paiml/claude-code-parity-apr#31 (M26) ships a
ccpa measuresubcommand that drivesapr code -pand synthesizes a student trace from stdout. That synthesis is text-only — tool dispatch, hooks, and skill invocations are invisible. M28 establishes the API contract for faithful trace emission so future M29+ enrichment can fill in tool/hook/skill records and produce a non-tautological FALSIFY-CCPA-013 discharge.🤖 Generated with Claude Code