fix(aprender-serve): prepare_tokens_apr — no chat-wrap on base models (closes #1623 part 2)#1658
Merged
Merged
Conversation
…w-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)
§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:
Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
but actual ML weights are stored [output_dim=N, input_dim=K]
row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
and PMAT-333 F32 dequantization output).
Symptom: GPU read transposed weights → computed y = A^T @ x instead
of y = A @ x → systematically anti-correlated logits
(cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
CPU mean=-2.42 vs GPU mean=0.013).
Fix: rewrite the inner loop to iterate along the K dimension within
row block_id:
row_base = a_ptr + block_id * K * 4
thread reads A[block_id, t], A[block_id, t+32], ...
instead of:
col_base = a_ptr + block_id * 4
thread reads A[t, block_id], A[t+32, block_id], ...
Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):
PARITY-GATE: PASS (no error from forward_gpu_resident)
Throughput @ 128-tok 5-iter decode: 124.6 tok/s
AC-SHIP1-007 floor: 30 tok/s
Headroom: 4.15× over floor
TTFT: 8.39 ms
p50 latency: 1016 ms
Before PR-E:
PARITY-GATE FAILED cos=-0.005190
Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
GPU CANNOT serve this model
After PR-E:
PARITY-GATE PASS, default path, NO workarounds
124.6 tok/s, 4.15× over floor
Ship-% impact:
MODEL-1 ship %: **99% → 100%**
10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
SHIP-001 (§72) SHIP-002 (§61) SHIP-003 (§72)
SHIP-004 (§72) SHIP-005 (§71) SHIP-006 (§61.8)
SHIP-007 (this PR) SHIP-008 (§61) SHIP-009 (§72)
SHIP-010 (§72)
MODEL-2 ship %: unchanged at 57% (independent track).
Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.
Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).
Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)
Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…07 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)
The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.
The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.
The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.
Test verified:
cargo test -p aprender-serve --lib \
quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
→ 1 passed
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…closes #1623 part 2) `prepare_tokens_apr` was auto-wrapping ALL APR models with a chat template when the model: - had a known architecture (qwen2 / llama / mistral / phi), OR - had `<|im_start|>` in vocab (ChatML special tokens), OR - had `instruct` in filename That's too broad. Base completion models like qwen2.5-coder-0.5b (base, not instruct) carry the Qwen tokenizer — which includes ChatML special tokens in vocab — but should NOT be chat-wrapped. The over-trigger produced garbage-looking output for base models. Fix mirrors the GGUF path (GH-278): only wrap when the model actually has a `tokenizer.chat_template` in metadata, OR when filename hints `instruct` / `-chat`. Architecture and vocab-token heuristics removed. Reported in #1623 (the Coursera capstone investigation) — confirmed `apr run ... '2+2=' --temperature 0 --no-gpu` produces coherent output on base qwen2.5-coder-0.5b after this fix. All 6 prepare_tokens tests still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the second half of #1623. `prepare_tokens_apr` was over-triggering chat-template wrapping on base completion models because the detection logic OR'd three signals — known architecture, ChatML tokens in vocab, filename hint — and base models that share a tokenizer with their instruct sibling (e.g. `qwen2.5-coder-0.5b` base) inherited `<|im_start|>` in vocab and got incorrectly wrapped.
The bug
```rust
let is_instruct_arch = matches!(apr_arch.to_lowercase().as_str(),
"qwen2" | "qwen" | "llama" | "mistral" | "phi" | "phi3"
);
let has_chatml = vocab.iter().any(|t| t == "<|im_start|>");
let filename_instruct = model_name.to_lowercase().contains("instruct");
let is_instruct = is_instruct_arch || has_chatml_tokens || filename_instruct; // ← too broad
```
`apr run qwen2.5-coder-0.5b.apr '2+2=' --temperature 0 --no-gpu` produced garbage instead of `4`.
Fix
Mirror the GGUF path (GH-278): read the actual `tokenizer.chat_template` from APR metadata. Drop the architecture/vocab-token heuristics. Keep filename as a fallback signal.
```rust
let has_chat_template = meta.extra
.get("tokenizer.chat_template")
.and_then(|v| v.as_str())
.is_some_and(|s| !s.is_empty());
let filename_instruct = model_name.to_lowercase().contains("instruct")
|| model_name.to_lowercase().contains("-chat");
let is_instruct = has_chat_template || filename_instruct;
```
PMAT-237 reconciliation
The previous comment said "Detect instruct from MODEL DATA, not filename — filename heuristic silently skips chat template for hash-named APR files." That intent is preserved: if a hash-named APR has a real `tokenizer.chat_template` in its metadata, it gets wrapped. The change is that vocab-presence ("contains `<|im_start|>`") is no longer treated as instruct-evidence, because base models share tokenizers with their instruct siblings.
Test plan
Reporter
#1623 from a Coursera ML capstone — confirmed fix on base `qwen2.5-coder-0.5b`.
🤖 Generated with Claude Code