Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559

## Summary

`apr run` and `apr qa` on `qwen2.5-coder-7b-instruct-q4_k_m.gguf` produce gibberish output via every GPU path, while CPU produces the correct answer. This is the same defect class as the previously-closed [#374](https://github.com/paiml/aprender/issues/374) (Q4K CUDA kernel, hidden_dim=3584) and [#559](https://github.com/paiml/aprender/issues/559) (GPU parity cosine=-0.005 on Qwen2.5-7B GQA hidden=3584, heads=28, kv=4), surfaced during v0.35.0 release dogfooding.

## Reproducer 1: `apr run` (wgpu/Vulkan, RTX 4090)

```
$ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf "What is 2+2?" --max-tokens 16
Backend: wgpu (Vulkan)
[PMAT-333] Dequantizing 28 layers (hidden=3584, heads=28/4, intermediate=18944)
[PMAT-333] Dequantized 337 weights, 28282.5 MB F32
[wgpu] Skipping weight 'lm_head' (2180.0 MB > 2147.5 MB limit) — CPU fallback

Output:
ampiezza

ampiezza

Completed in 40.49s (cached)
exit=0
```

## Reproducer 2: `apr qa` (cuBLAS / FP8 path)

```
$ apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf
✓ PASS Capability Match
✓ PASS Tensor Contract  (339 tensors, all PMAT-235 gates)
✓ PASS Metadata Plausibility  (arch=qwen2, rope_theta=1000000, max_pos=131072)
[GH-129] Early kernel preload: 49 modules compiled
[PMAT-082] cuBLASLt FP8 JIT warmed (3584×16×3584)
[PMAT-053] FP8 weight cache: 197 matrices cached (6742.8 MB) in 212.7ms
✗ FAIL Golden Output  GPU output failed (CPU passed): golden_output_gpu: gibberish (fragment "<|im_start|>" repeats 3+ times)
```

## Reproducer 3: 1.5B Q4K APR works (safety net firing)

```
$ apr qa /home/noah/models/qwen2.5-coder-1.5b-instruct-q4k.apr
[GH-480] F2 validation FAILED — falling back to CPU
[GH-480] F2 validation FAILED — falling back to CPU
✓ PASS Golden Output  (2 golden test cases passed)
```

1.5B model: F2 validation rejects the GPU output, CPU fallback runs, test passes.
7B model: NO F2 validation rejection — GPU result is emitted directly. Gate is either size-dependent or bypassed for 7B's code path.

## Observations

1. **Both GPU backends fail differently:**
   - wgpu produces Italian "ampiezza" — likely a layout/offset corruption in CPU fallback for the single skipped `lm_head`.
   - cuBLAS FP8 produces "<|im_start|>" repeats — a different corruption mode (looks like an attention/sampling failure mid-pipeline).

2. **wgpu's `max_storage_buffer_binding_size` ≈ 2147.5 MB.** This is `u32::MAX / 2 ≈ 2^31`, the standard wgpu `MAX_STORAGE_BUFFER_BINDING_SIZE`. 7B's lm_head dequantized to F32 is 152064 × 3584 × 4 = 2180 MB, exceeding the limit. The fallback at `wgsl_forward.rs:567` silently uploads the weight to CPU; the CPU fallback path on `lm_head` alone appears to corrupt output.

3. **F2 / parity gates protect 1.5B but not 7B.** The `apr-cpu-vs-gpu-output-parity-v1` contract refuses GPU output when cosine vs CPU < 0.99 on 1.5B. For 7B with partial CPU fallback (one weight only), no such gate fires — bad output reaches stdout.

4. **`apr run` exit=0 on gibberish.** Per `apr-cpu-vs-gpu-output-parity-v1` and the lessons in [#374 / #559](https://github.com/paiml/aprender/issues/374), parity-rejected output should not be silently emitted with exit 0. This is a secondary defect that lets the primary bug ship to end users via `apr serve` and `apr code`.

## Severity

**P0 — release-blocker for v0.35.0.** The performance table in README.md claims "Qwen2.5-Coder 7B Q4_K 225+ tok/s RTX 4090", which is the canonical demo configuration for the project. `apr qa` already detects the bug as a Golden Output failure; shipping v0.35.0 to crates.io would push the regression to every user who tries the headline 7B model on GPU.

## Suggested next steps

1. **Bisect** v0.34.0 (2026-05-18 tag, 7e3c9cfa5) → HEAD (0d8d52b25, 71 commits) for the first commit that fails `apr qa /path/to/qwen2.5-coder-7b-instruct-q4_k_m.gguf` Golden Output gate. Candidates from `git log v0.34.0..HEAD -- crates/aprender-serve/src/api/cuda_chat_backend.rs`:
   - 6bff4ce2c qwen3-moe streaming SSE (#1854)
   - 65f4af005 try_qwen3_moe_backend EOS stop_tokens (#1852)
   - 1910e9eb6 3-knob HTTP wire-up (#1846)
   - 9c974524f qwen3_moe arch guard at /v1/chat/completions (#1806)
2. **Tighten the parity gate** to cover the partial-CPU-fallback case (currently only triggers on full GPU run).
3. **Make `apr run` exit non-zero** when the parity gate rejects the GPU output — gibberish should never ship with exit 0.

## Artifacts

- Host: noah-Lambda-Vector (RTX 4090, sm_89, wgpu via Vulkan, CUDA 12.x, FP8 cuBLASLt available)
- Model: qwen2.5-coder-7b-instruct-q4_k_m.gguf (4.7 GB)
- Build: HEAD = 0d8d52b25 (release/v0.35.0 worktree)

Surfaced by v0.35.0 release dogfood, 2026-05-22.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Summary

Reproducer 1: `apr run` (wgpu/Vulkan, RTX 4090)

Reproducer 2: `apr qa` (cuBLAS / FP8 path)

Reproducer 3: 1.5B Q4K APR works (safety net firing)

Observations

Severity

Suggested next steps

Artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Description

Summary

Reproducer 1: apr run (wgpu/Vulkan, RTX 4090)

Reproducer 2: apr qa (cuBLAS / FP8 path)

Reproducer 3: 1.5B Q4K APR works (safety net firing)

Observations

Severity

Suggested next steps

Artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Reproducer 1: `apr run` (wgpu/Vulkan, RTX 4090)

Reproducer 2: `apr qa` (cuBLAS / FP8 path)