Summary
apr run and apr qa on qwen2.5-coder-7b-instruct-q4_k_m.gguf produce gibberish output via every GPU path, while CPU produces the correct answer. This is the same defect class as the previously-closed #374 (Q4K CUDA kernel, hidden_dim=3584) and #559 (GPU parity cosine=-0.005 on Qwen2.5-7B GQA hidden=3584, heads=28, kv=4), surfaced during v0.35.0 release dogfooding.
Reproducer 1: apr run (wgpu/Vulkan, RTX 4090)
$ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf "What is 2+2?" --max-tokens 16
Backend: wgpu (Vulkan)
[PMAT-333] Dequantizing 28 layers (hidden=3584, heads=28/4, intermediate=18944)
[PMAT-333] Dequantized 337 weights, 28282.5 MB F32
[wgpu] Skipping weight 'lm_head' (2180.0 MB > 2147.5 MB limit) — CPU fallback
Output:
ampiezza
ampiezza
Completed in 40.49s (cached)
exit=0
Reproducer 2: apr qa (cuBLAS / FP8 path)
$ apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf
✓ PASS Capability Match
✓ PASS Tensor Contract (339 tensors, all PMAT-235 gates)
✓ PASS Metadata Plausibility (arch=qwen2, rope_theta=1000000, max_pos=131072)
[GH-129] Early kernel preload: 49 modules compiled
[PMAT-082] cuBLASLt FP8 JIT warmed (3584×16×3584)
[PMAT-053] FP8 weight cache: 197 matrices cached (6742.8 MB) in 212.7ms
✗ FAIL Golden Output GPU output failed (CPU passed): golden_output_gpu: gibberish (fragment "<|im_start|>" repeats 3+ times)
Reproducer 3: 1.5B Q4K APR works (safety net firing)
$ apr qa /home/noah/models/qwen2.5-coder-1.5b-instruct-q4k.apr
[GH-480] F2 validation FAILED — falling back to CPU
[GH-480] F2 validation FAILED — falling back to CPU
✓ PASS Golden Output (2 golden test cases passed)
1.5B model: F2 validation rejects the GPU output, CPU fallback runs, test passes.
7B model: NO F2 validation rejection — GPU result is emitted directly. Gate is either size-dependent or bypassed for 7B's code path.
Observations
-
Both GPU backends fail differently:
- wgpu produces Italian "ampiezza" — likely a layout/offset corruption in CPU fallback for the single skipped
lm_head.
- cuBLAS FP8 produces "<|im_start|>" repeats — a different corruption mode (looks like an attention/sampling failure mid-pipeline).
-
wgpu's max_storage_buffer_binding_size ≈ 2147.5 MB. This is u32::MAX / 2 ≈ 2^31, the standard wgpu MAX_STORAGE_BUFFER_BINDING_SIZE. 7B's lm_head dequantized to F32 is 152064 × 3584 × 4 = 2180 MB, exceeding the limit. The fallback at wgsl_forward.rs:567 silently uploads the weight to CPU; the CPU fallback path on lm_head alone appears to corrupt output.
-
F2 / parity gates protect 1.5B but not 7B. The apr-cpu-vs-gpu-output-parity-v1 contract refuses GPU output when cosine vs CPU < 0.99 on 1.5B. For 7B with partial CPU fallback (one weight only), no such gate fires — bad output reaches stdout.
-
apr run exit=0 on gibberish. Per apr-cpu-vs-gpu-output-parity-v1 and the lessons in #374 / #559, parity-rejected output should not be silently emitted with exit 0. This is a secondary defect that lets the primary bug ship to end users via apr serve and apr code.
Severity
P0 — release-blocker for v0.35.0. The performance table in README.md claims "Qwen2.5-Coder 7B Q4_K 225+ tok/s RTX 4090", which is the canonical demo configuration for the project. apr qa already detects the bug as a Golden Output failure; shipping v0.35.0 to crates.io would push the regression to every user who tries the headline 7B model on GPU.
Suggested next steps
- Bisect v0.34.0 (2026-05-18 tag, 7e3c9cfa5) → HEAD (0d8d52b, 71 commits) for the first commit that fails
apr qa /path/to/qwen2.5-coder-7b-instruct-q4_k_m.gguf Golden Output gate. Candidates from git log v0.34.0..HEAD -- crates/aprender-serve/src/api/cuda_chat_backend.rs:
- Tighten the parity gate to cover the partial-CPU-fallback case (currently only triggers on full GPU run).
- Make
apr run exit non-zero when the parity gate rejects the GPU output — gibberish should never ship with exit 0.
Artifacts
- Host: noah-Lambda-Vector (RTX 4090, sm_89, wgpu via Vulkan, CUDA 12.x, FP8 cuBLASLt available)
- Model: qwen2.5-coder-7b-instruct-q4_k_m.gguf (4.7 GB)
- Build: HEAD = 0d8d52b (release/v0.35.0 worktree)
Surfaced by v0.35.0 release dogfood, 2026-05-22.
Summary
apr runandapr qaonqwen2.5-coder-7b-instruct-q4_k_m.ggufproduce gibberish output via every GPU path, while CPU produces the correct answer. This is the same defect class as the previously-closed #374 (Q4K CUDA kernel, hidden_dim=3584) and #559 (GPU parity cosine=-0.005 on Qwen2.5-7B GQA hidden=3584, heads=28, kv=4), surfaced during v0.35.0 release dogfooding.Reproducer 1:
apr run(wgpu/Vulkan, RTX 4090)Reproducer 2:
apr qa(cuBLAS / FP8 path)Reproducer 3: 1.5B Q4K APR works (safety net firing)
1.5B model: F2 validation rejects the GPU output, CPU fallback runs, test passes.
7B model: NO F2 validation rejection — GPU result is emitted directly. Gate is either size-dependent or bypassed for 7B's code path.
Observations
Both GPU backends fail differently:
lm_head.wgpu's
max_storage_buffer_binding_size≈ 2147.5 MB. This isu32::MAX / 2 ≈ 2^31, the standard wgpuMAX_STORAGE_BUFFER_BINDING_SIZE. 7B's lm_head dequantized to F32 is 152064 × 3584 × 4 = 2180 MB, exceeding the limit. The fallback atwgsl_forward.rs:567silently uploads the weight to CPU; the CPU fallback path onlm_headalone appears to corrupt output.F2 / parity gates protect 1.5B but not 7B. The
apr-cpu-vs-gpu-output-parity-v1contract refuses GPU output when cosine vs CPU < 0.99 on 1.5B. For 7B with partial CPU fallback (one weight only), no such gate fires — bad output reaches stdout.apr runexit=0 on gibberish. Perapr-cpu-vs-gpu-output-parity-v1and the lessons in #374 / #559, parity-rejected output should not be silently emitted with exit 0. This is a secondary defect that lets the primary bug ship to end users viaapr serveandapr code.Severity
P0 — release-blocker for v0.35.0. The performance table in README.md claims "Qwen2.5-Coder 7B Q4_K 225+ tok/s RTX 4090", which is the canonical demo configuration for the project.
apr qaalready detects the bug as a Golden Output failure; shipping v0.35.0 to crates.io would push the regression to every user who tries the headline 7B model on GPU.Suggested next steps
apr qa /path/to/qwen2.5-coder-7b-instruct-q4_k_m.ggufGolden Output gate. Candidates fromgit log v0.34.0..HEAD -- crates/aprender-serve/src/api/cuda_chat_backend.rs:apr runexit non-zero when the parity gate rejects the GPU output — gibberish should never ship with exit 0.Artifacts
Surfaced by v0.35.0 release dogfood, 2026-05-22.