Skip to content

fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF#1819

Merged
noahgift merged 2 commits into
mainfrom
fix/1789-v1-001-integration-test
May 19, 2026
Merged

fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF#1819
noahgift merged 2 commits into
mainfrom
fix/1789-v1-001-integration-test

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001 + V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 → v1.1.1). Empirical evidence previously only via direct curl smoke test; this PR pins the invariant into CI as an opt-in integration test.

What the test does

`crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`:

  • Loads a real Qwen3-MoE GGUF (via `QWEN3_MOE_GGUF_PATH` env var)
  • Builds AppState via `with_quantized_model_and_vocab` + `with_mapped_gguf_model` (Option B path)
  • Creates the router via `realizar::api::create_router`
  • POSTs `/v1/chat/completions` with max_tokens=4, temperature=0
  • Asserts:

Gated `#[ignore]` by default. CI-safe: skips with eprintln when env var missing.

Empirical evidence (this PR)

QWEN3_MOE_GGUF_PATH=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \\
  cargo test --test qwen3_moe_serve_dispatch_v1 \\
  -p aprender-serve --features cuda --release -- --ignored --nocapture

Passed in 7.84s wall:

{\"id\":\"chatcmpl-q4k-1779208970299\",\"object\":\"chat.completion\",\"model\":\"qwen3-moe-v1-001\",
 \"choices\":[{\"message\":{\"role\":\"assistant\",\"content\":\"Human: What\"}}],
 \"usage\":{\"prompt_tokens\":13,\"completion_tokens\":4,\"total_tokens\":17}}

Contract bump

v1.1.0 → v1.1.1 — V1_001 + V1_003 evidence fields updated with the new cargo command + recorded pass time; `status_history` appends v1.1.1 entry.

V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains independently BLOCKED on M32d KV cache work — see paiml/claude-code-parity-apr `evidence/phase-6/30b-moe-empirical-2026-05-19.md`.

Test plan

  • `cargo check --test qwen3_moe_serve_dispatch_v1` — clean
  • `cargo test --test qwen3_moe_serve_dispatch_v1` without env var — skipped via #[ignore]
  • `QWEN3_MOE_GGUF_PATH=... cargo test ... --release -- --ignored` — PASS in 7.84s
  • CI (sovereign-ci full workflow; test stays #[ignore] on CI)

🤖 Generated with Claude Code

noahgift and others added 2 commits May 19, 2026 18:36
…" key (PMAT-698j)

THE root-cause bug behind the entire Phase 3 cuda dispatch cascade
(PMAT-698e..i, 6 prior PRs). Discovered by PMAT-698i's [FWD-CACHE]
diagnostic logging.

The `warm!` macro in pre_warm_for_model:

  macro_rules! warm {
      ($key:expr, $kernel:expr) => {{
          let ptx = $kernel.emit_ptx_for_target(&target);
          self.get_or_compile("silu_forward", &ptx)?;  // <-- HARDCODED
          count += 1;
      }};
  }

Every single `warm!()` call stored its compiled module under the
hashmap key "silu_forward", colliding on the first call:

  1. warm!("batched_rmsnorm_fwd_896", BatchedVectorizedRmsNormKernel...)
     → cache["silu_forward"] = BatchedVectorizedRmsNorm PTX
  2. warm!("gemm_forward_...", ...)
     → cache["silu_forward"] already Occupied → returns existing entry,
       new PTX silently discarded
  3-23. same — all subsequent kernels never actually pre-warm.

At runtime, every kernel looks up its real cache key:

  let key = format!("batched_rmsnorm_fwd_{hidden_size}_eps{eps_bits:08x}");
  match cache.get_cached(&key) { Some(m) => m, None => JIT }

— and cache-MISSES because the cache contains exactly one entry
under "silu_forward". JIT fires for every "pre-warmed" kernel during
the first forward pass — exactly when Blackwell sm_121's CUDA driver
crashes on cuModuleLoadData during active GPU work.

PMAT-698i's [FWD-CACHE] logging surfaced this: every kernel that was
"supposed to be pre-warmed" emitted [FWD-CACHE] Compiling at runtime,
proving the cache had nothing in it under those keys.

Fix: pass $key through to get_or_compile. One-character change
("silu_forward" → &key).

This explains the entire PMAT-698e..i cascade:
- PMAT-698e (workspace cap) — legit independent bug
- PMAT-698f (APR magic) — legit independent bug
- PMAT-698g (non-LoRA backward pre-warm) — would have been fine IF
  forward pre-warm worked; the backward kernels were correctly stored
  under their real keys (backward macro doesn't have the typo).
  Defense-in-depth, still valuable.
- PMAT-698h (rms_norm_gamma_reduce) — same defense-in-depth.
- PMAT-698j (THIS) — the root cause.

The previous PMAT-698g/h fixes are still correct (they covered backward
gaps that exist independently). This PR addresses the forward cache,
which was the dominant source of post-pre-warm JIT events.

Test plan:
- [x] cargo check --features cuda — clean build
- [x] 366 autograd lib tests pass
- [ ] Live gx10 dispatch (post-merge) shows ZERO [FWD-CACHE] Compiling
      events post-pre-warm (all 23 forward kernels now actually cached)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001
+ V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 →
v1.1.1). Pins the chat-completions MoE dispatch invariant into CI as
an opt-in integration test.

## What the test does

`crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`:
- Loads a real Qwen3-MoE GGUF (via QWEN3_MOE_GGUF_PATH env var)
- Builds AppState with `with_quantized_model_and_vocab` + attaches
  retained mmap via `with_mapped_gguf_model` (Option B path)
- Creates the router via `realizar::api::create_router`
- POSTs `/v1/chat/completions` with max_tokens=4, temperature=0
- Asserts:
  - HTTP 200 (V1_001: dispatch returns non-error)
  - Non-empty `choices[0].message.content` (V1_001: actual generation)
  - Body does NOT contain "InvalidShape" or "matmul weight has EMPTY
    data buffer" (V1_003: #1790 defensive guard did not fire — proves
    MoE path was taken, not dense)

Gated `#[ignore]` by default. Activated by:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test qwen3_moe_serve_dispatch_v1 \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

If `QWEN3_MOE_GGUF_PATH` is unset, test prints a SKIP message and
passes — does not block CI on hosts without a real qwen3_moe GGUF.

## Empirical evidence (this PR)

Test passed against `/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf`
in 7.84s wall. Response body:
```json
{"id":"chatcmpl-q4k-1779208970299","object":"chat.completion","model":"qwen3-moe-v1-001",
 "choices":[{"message":{"role":"assistant","content":"Human: What"}}],
 "usage":{"prompt_tokens":13,"completion_tokens":4,"total_tokens":17}}
```

Non-empty `content` + no `InvalidShape` → V1_001 + V1_003 cargo-test
discharged.

## Contract bump

`qwen3-moe-serve-dispatch-v1.yaml` v1.1.0 → v1.1.1:
- V1_001 evidence updated with new cargo-test command + empirical run record
- V1_003 evidence updated to same
- status_history appends v1.1.1 entry noting formal discharge

V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains
BLOCKED on M32d KV cache work — independent contract gate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 207dfde into main May 19, 2026
11 checks passed
@noahgift noahgift deleted the fix/1789-v1-001-integration-test branch May 19, 2026 17:15
noahgift added a commit that referenced this pull request May 20, 2026
Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
… C-prep) (#1840)

* feat(M32d): KV cache for qwen3_moe inference path — 19× speedup

Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(distill): add DATASET_DIR env var → --dataset flag (Phase 4 Stage C-prep)

Threads the Phase 4 Stage B-2 `apr distill --dataset <DIR>` flag through
the gx10 dispatch script. When `DATASET_DIR` is set, the script passes
the directory to apr distill, which drives training from real-corpus
.bin shards via ShardBatchSource. When unset (default), the pipeline
falls back to SyntheticBatchSource (Phase 3 smoke semantics).

Preamble now surfaces which mode the dispatch is in:
  dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset)
  dataset: (synthetic — Phase 3 smoke semantics)

Validates the directory exists on gx10 before launching apr distill;
fails fast with a clear message otherwise.

Usage:
  STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \
    bash scripts/dispatch-distill-phase-3-gx10.sh

Depends on PR #1839 (Stage B-2) landing first so `apr distill --dataset`
exists on the rebuilt gx10 binary. With #1839 unmerged the script's
invocation falls back to a clap error.

Phase 4 ladder progress:
  Stage A (#1833)                 ✅ MERGED + verified
  Stage B-1 (#1836)               ✅ MERGED
  Stage B-2 (#1839)               🟡 in CI
  Stage C-prep (THIS)             dispatch script + 10 pre-staged shards
  Stage C                          run on gx10 with --dataset
  Stage D                          50K-step Phase 4 dispatch
  Stage E                          HumanEval pass@1
  Stage F                          publish v2

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant