Skip to content

feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission)#1100

Closed
noahgift wants to merge 2 commits into
mainfrom
feat/apr-code-emit-trace-m28
Closed

feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission)#1100
noahgift wants to merge 2 commits into
mainfrom
feat/apr-code-emit-trace-m28

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Adds apr code --emit-trace <path> flag — when set, after the agent loop completes the runtime writes a 4-record ccpa-trace.jsonl file describing the run.

Format mirrors the schema at paiml/claude-code-parity-apr / contracts/claude-code-parity-apr-v1.yaml § trace_schema. The companion-repo ccpa measure subcommand (paiml/claude-code-parity-apr#31, M26) consumes this file to score apr-code against canonical Claude Code reference fixtures.

Records emitted

# kind payload
1 session_start synthetic UUIDv7-shaped session_id, ts, actor=apr-code, model path, cwd_sha256 placeholder
2 user_prompt turn 0, verbatim text
3 assistant_turn turn 1, single Block::Text with agent's final response
4 session_end real elapsed_ms + tokens_in/tokens_out from AgentLoopResult.usage

Tool dispatch / hook / skill records are M29+ enrichment follow-ups.

Plumbing

  • commands_enum.rs — new emit_trace: Option<PathBuf> field on the Code variant
  • dispatch.rs — threads it into batuta::agent::code::cmd_code
  • code.rs cmd_code — accepts the new param + plumbs to run_single_prompt
  • code.rs run_single_prompt — captures Instant::now() at start; after the agent loop returns Ok(r), if --emit-trace was set, calls the new helper. Write failures eprintln! a warning but do NOT fail the agent run.
  • code.rs emit_ccpa_trace — new ~85 LOC helper that hand-rolls JSONL via serde_json::json! macros (no new dependency on ccpa_trace types).

Tests

4 new in code_tests.rs::emit_trace_tests (50 → 54 passing in agent::code):

  • emit_writes_4_jsonl_records_with_correct_kinds
  • emit_carries_prompt_and_response_text
  • emit_carries_token_counts_and_elapsed
  • emit_each_record_has_v1_envelope (per-record back-compat invariant from ccpa-trace v2)

Live dogfood

$ apr code --emit-trace /tmp/measured.jsonl \
    -p "Show me which CLAUDE.md takes precedence right now"
$ jq -r '.kind' /tmp/measured.jsonl
session_start
user_prompt
assistant_turn
session_end
$ jq 'select(.kind=="session_end")' /tmp/measured.jsonl
{"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn",
 "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024}

Real elapsed_ms / token counts populated correctly. The response text was gibberish in the dogfood because Qwen3-1.7B is hitting <think>-loop issues (PMAT-190 — pre-existing aprender concern, separate workstream). The trace format is correct; the model behavior is unrelated.

What I checked

  • cargo build -p apr-cli --features code — clean
  • cargo fmt --check — clean on changed crates
  • cargo test -p aprender-orchestrate --lib agent::code — 54 passing
  • Live dogfood produces a valid 4-record JSONL

Why now

paiml/claude-code-parity-apr#31 (M26) ships a ccpa measure subcommand that drives apr code -p and synthesizes a student trace from stdout. That synthesis is text-only — tool dispatch, hooks, and skill invocations are invisible. M28 establishes the API contract for faithful trace emission so future M29+ enrichment can fill in tool/hook/skill records and produce a non-tautological FALSIFY-CCPA-013 discharge.

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 28, 2026 08:29
noahgift added a commit that referenced this pull request Apr 28, 2026
Records the M28 launch in status_history. Apr-side feature lives
on a separate branch (feat/apr-code-emit-trace-m28, #1100).

Refs: #1100 (M28 — apr code --emit-trace)
      paiml/claude-code-parity-apr@feat/m28-record-aprender-pr

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit to paiml/claude-code-parity-apr that referenced this pull request Apr 28, 2026
Companion-side bookkeeping for the M28 upstream feature.

The apr-cli feature itself lives on a separate aprender branch
(feat/apr-code-emit-trace-m28, paiml/aprender#1100). This PR
records the launch in the companion contract's status_history
audit trail.

What landed upstream (paiml/aprender#1100):
  - new `--emit-trace <path>` flag on `apr code`
  - 4-record ccpa-trace.jsonl emission after every -p run
    (session_start + user_prompt + assistant_turn + session_end)
  - real elapsed_ms + token counts from AgentLoopResult.usage
  - 4 new unit tests; 50 → 54 passing in agent::code

Live dogfood (verified before this PR):
  $ apr code --emit-trace /tmp/measured.jsonl -p "..."
  → 4-line valid ccpa-trace.jsonl
  → elapsed_ms=3295, tokens_in=44, tokens_out=1024 populated correctly

Tool dispatch / hook event / skill invocation records remain M29+
enrichment follow-ups (text-only path is what M28 ships).

Contract bump v1.15.0 → v1.16.0:
  - status field annotated with the M28 launch
  - status_history M28 entry detailing what shipped, dogfood result,
    and what remains for M29+
  - aprender contract-mirror at byte-identical commit 8549cdc69
  - pin.lock refreshed (sha256 e979ddfd...)

Gates (all green locally):
  pv validate / pv lint                       PASS
  pmat comply check (is_compliant)            true, 0 Fail, 12 advisory Warn
  cargo test --workspace                      all pass (0 new tests companion-side)
  scripts/pin-check.sh                        sha256 matches
  scripts/pin-check-roundtrip.sh              byte-identical to aprender@8549cdc69

Refs: paiml/aprender#1100 (M28 upstream PR)
      contracts/claude-code-parity-apr-v1.yaml § status_history (M28)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift and others added 2 commits April 28, 2026 12:00
Adds `apr code --emit-trace <path>` flag — when set, after the agent
loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file
to `<path>` describing the run.

Format mirrors the schema at
https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml
§ trace_schema. The companion-repo `ccpa measure` subcommand (M26)
consumes this file to score apr-code against canonical Claude Code
reference fixtures.

Records emitted:

  1. session_start  — synthetic UUIDv7-shaped session_id derived from
                      the start ts; ts is a timestamp string;
                      cwd_sha256 is a 64-char placeholder (the
                      companion-repo differ normalizes these at compare
                      time).
  2. user_prompt    — turn 0, verbatim text.
  3. assistant_turn — turn 1, single Block::Text carrying the agent's
                      final response text. Tool dispatch / hook /
                      skill records are M29+ enrichment follow-ups.
  4. session_end    — real elapsed_ms + token counts from
                      AgentLoopResult.usage (input_tokens /
                      output_tokens). Real metadata, not stubbed.

Plumbing:
  - commands_enum.rs   — new `emit_trace: Option<PathBuf>` field on
                         the Code variant.
  - dispatch.rs        — threads it into batuta::agent::code::cmd_code.
  - code.rs cmd_code   — accepts the new param + plumbs to
                         run_single_prompt.
  - code.rs run_single_prompt — captures `Instant::now()` at start;
                         after the agent loop returns Ok(r), if the
                         caller passed --emit-trace, calls the new
                         emit_ccpa_trace helper. On write-failure
                         eprintln! a warning but DO NOT fail the
                         agent run.
  - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls
                         JSONL via serde_json::json! macros (no new
                         dependency on ccpa_trace types).

Tests (4 new in code_tests.rs::emit_trace_tests):
  - emit_writes_4_jsonl_records_with_correct_kinds
  - emit_carries_prompt_and_response_text
  - emit_carries_token_counts_and_elapsed
  - emit_each_record_has_v1_envelope (per-record back-compat
    invariant from the ccpa-trace v2 schema)

Total in agent::code: 50 → 54 tests passing.

Live dogfood:
  $ apr code --emit-trace /tmp/measured.jsonl \
      -p "Show me which CLAUDE.md takes precedence right now"
  $ cat /tmp/measured.jsonl | jq -r '.kind'
    session_start
    user_prompt
    assistant_turn
    session_end
  $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")'
    {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn",
     "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024}

Real elapsed_ms / token counts populated correctly.

Note: the response text from Qwen3-1.7B in the dogfood was gibberish
(<think>-loop pre-existing aprender concern, see PMAT-190). The trace
format is correct; the model behavior is a separate workstream. The
emit-trace flag works regardless of model quality.

Refs:
  - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand
    that consumes this file)
  - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml
    § trace_schema (the canonical schema)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the
default model for `apr code` when present. Aligned with the
research write-up at paiml/claude-code-parity-apr / 2026-04-28.

What ships:

  configs/aliases.yaml
    + new short name `qwen3-coder` →
      hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
    Now `apr pull qwen3-coder` works.

  crates/aprender-registry/src/aliases.rs
    + matching entry in the in-memory AliasRegistry
      (kept in sync with configs/aliases.yaml).

  crates/aprender-orchestrate/src/agent/manifest.rs
    + `~/.cache/pacha/models/` added to model_search_dirs so
      `apr pull`-cached files (content-hashed names) are visible
      to discovery; pair with a friendly symlink in
      `~/.apr/models/` for the preferred-name filter to recognize.
    + new module-level helper `is_preferred_default_model(path)`:
      case-insensitive substring match against a short list of
      recommended-default model names. Order:
        1. qwen3-coder-30b-a3b
        2. qwen3-coder-next
        3. qwen2.5-coder-32b
        4. qwen2.5-coder-14b
    + discover_model + sort_candidates updated to insert
      preferred-name as a sort key BETWEEN validity (still wins
      overall) and newest-mtime. So when a small recently-pulled
      model exists alongside the recommended default, the
      recommended default is selected — fixing the failure mode
      where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits
      gibberish) was being auto-picked over a known-good 30B model.

Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest):
  - preferred_default_recognises_qwen3_coder_30b_a3b
    (any-case, any-quant matching)
  - preferred_default_rejects_small_fallbacks
    (1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is
    still useful but we don't anchor it as the recommended-default
    family for 24 GB GPUs)
  - sort_candidates_promotes_preferred_over_newer
    (preferred-name beats newer-but-smaller mtime)
  - sort_candidates_newer_preferred_beats_older_preferred
    (within preferred-names, mtime still tiebreaks)
  - sort_candidates_validity_outranks_preference
    (Jidoka — invalid preferred loses to valid non-preferred)

Live verification (this PR):

  $ apr pull qwen3-coder
    ✓ Downloaded successfully
      Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
      Size: 17.3 GB

  $ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \
         /home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf

  $ apr code -p "ping" --max-turns 1
    Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered)
    ↑ default-model preference picked correctly.

Known gap (NOT addressed by this PR):

  After auto-discovery picks the model, both apr-serve subprocess
  and embedded inference fail with:

    Error: driver error: inference failed:
           Invalid shape: Tensor 'blk.0.ffn_up.weight' not found

  Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses
  per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.),
  not the dense `ffn_up.weight` the current realizar GGUF loader
  expects. qwen3moe architecture support is upstream realizar
  work — separate from this PR. The discovery / alias / preferred-
  name selection mechanism is fully ready for when that lands.

  In the interim users hitting the inference error should fall
  back to a dense model — either Qwen2.5-Coder-32B-Instruct
  (also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B.

Refs:
  - Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28
  - Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  - aprender CLAUDE.md § Claude Messages-API proxy spec — same model
    is already declared as the default for `apr serve anthropic`

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/apr-code-emit-trace-m28 branch from da4ada6 to b8a6495 Compare April 28, 2026 10:01
@noahgift

Copy link
Copy Markdown
Contributor Author

Superseded by #1102 which landed the same emit-trace + default-model code. Closing to consolidate.

@noahgift noahgift closed this Apr 28, 2026
auto-merge was automatically disabled April 28, 2026 10:07

Pull request was closed

noahgift added a commit that referenced this pull request Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127)

## What ships

Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` —
F-QW3-MOE-C22214-001, an integration test that invokes the user-facing
`apr` binary as a subprocess and asserts:

  1. exit 0
  2. stdout contains ≥1 non-whitespace character

against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
with a fresh date-tagged prompt.

This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126,
squash a902eea) in CI / regression-prevention. Without it, a
future regression that re-routed qwen3_moe back to the dense
`run_gguf_generate` path (which produces garbage on MoE weights)
would slip through CI silently — there'd be no signal at the
`apr run` user-facing surface.

## Live evidence (lambda-vector RTX 4090, 2026-04-29)

```
running 1 test
test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ...
F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
F-QW3-MOE-C22214-001: elapsed = 130.945370974s
  stdout (first 200B): === APR Run ===

Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf

Output:
.

Completed in 130.83s (cached)

  stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'
[BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'

F-QW3-MOE-C22214-001: PASS
ok

test result: ok. 1 passed; 0 failed; 0 ignored
```

Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This
test asserts ONLY emit/exit-0 — the discharge gate for
FALSIFY-QW3-MOE-FORWARD-003.

## Skip path

CI runners (and any host without the cached GGUF) print:

  F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...]

and return success. Same skip pattern as
`crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs`
(M32c.2.2.2.1.1 in-process forward primitive).

## Contract chain status

  M32a    qwen3-moe-forward-v1 contract scaffold        SHIPPED (#1099)
  M32b    arch-aware FFN load refuses qwen3_moe          SHIPPED (#1100)
  M32c.1+ MoE descriptor load + per-expert byte slicer   SHIPPED
  M32c.2.2.2.1.1 forward_qwen3_moe method                SHIPPED (#1124)
  M32c.2.2.2.1.2 run_qwen3_moe_generate function         SHIPPED (#1125)
  M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch   SHIPPED (#1126)
  M32c.2.2.2.1.4 live `apr run` falsifier               THIS PR
  M32d           numerical parity vs llama.cpp           PENDING

After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which
unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch
parity gate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant