Skip to content

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

@noahgift

Description

@noahgift

Summary

apr run and apr qa on qwen2.5-coder-7b-instruct-q4_k_m.gguf produce gibberish output via every GPU path, while CPU produces the correct answer. This is the same defect class as the previously-closed #374 (Q4K CUDA kernel, hidden_dim=3584) and #559 (GPU parity cosine=-0.005 on Qwen2.5-7B GQA hidden=3584, heads=28, kv=4), surfaced during v0.35.0 release dogfooding.

Reproducer 1: apr run (wgpu/Vulkan, RTX 4090)

$ apr run /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf "What is 2+2?" --max-tokens 16
Backend: wgpu (Vulkan)
[PMAT-333] Dequantizing 28 layers (hidden=3584, heads=28/4, intermediate=18944)
[PMAT-333] Dequantized 337 weights, 28282.5 MB F32
[wgpu] Skipping weight 'lm_head' (2180.0 MB > 2147.5 MB limit) — CPU fallback

Output:
ampiezza

ampiezza

Completed in 40.49s (cached)
exit=0

Reproducer 2: apr qa (cuBLAS / FP8 path)

$ apr qa /home/noah/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf
✓ PASS Capability Match
✓ PASS Tensor Contract  (339 tensors, all PMAT-235 gates)
✓ PASS Metadata Plausibility  (arch=qwen2, rope_theta=1000000, max_pos=131072)
[GH-129] Early kernel preload: 49 modules compiled
[PMAT-082] cuBLASLt FP8 JIT warmed (3584×16×3584)
[PMAT-053] FP8 weight cache: 197 matrices cached (6742.8 MB) in 212.7ms
✗ FAIL Golden Output  GPU output failed (CPU passed): golden_output_gpu: gibberish (fragment "<|im_start|>" repeats 3+ times)

Reproducer 3: 1.5B Q4K APR works (safety net firing)

$ apr qa /home/noah/models/qwen2.5-coder-1.5b-instruct-q4k.apr
[GH-480] F2 validation FAILED — falling back to CPU
[GH-480] F2 validation FAILED — falling back to CPU
✓ PASS Golden Output  (2 golden test cases passed)

1.5B model: F2 validation rejects the GPU output, CPU fallback runs, test passes.
7B model: NO F2 validation rejection — GPU result is emitted directly. Gate is either size-dependent or bypassed for 7B's code path.

Observations

  1. Both GPU backends fail differently:

    • wgpu produces Italian "ampiezza" — likely a layout/offset corruption in CPU fallback for the single skipped lm_head.
    • cuBLAS FP8 produces "<|im_start|>" repeats — a different corruption mode (looks like an attention/sampling failure mid-pipeline).
  2. wgpu's max_storage_buffer_binding_size ≈ 2147.5 MB. This is u32::MAX / 2 ≈ 2^31, the standard wgpu MAX_STORAGE_BUFFER_BINDING_SIZE. 7B's lm_head dequantized to F32 is 152064 × 3584 × 4 = 2180 MB, exceeding the limit. The fallback at wgsl_forward.rs:567 silently uploads the weight to CPU; the CPU fallback path on lm_head alone appears to corrupt output.

  3. F2 / parity gates protect 1.5B but not 7B. The apr-cpu-vs-gpu-output-parity-v1 contract refuses GPU output when cosine vs CPU < 0.99 on 1.5B. For 7B with partial CPU fallback (one weight only), no such gate fires — bad output reaches stdout.

  4. apr run exit=0 on gibberish. Per apr-cpu-vs-gpu-output-parity-v1 and the lessons in #374 / #559, parity-rejected output should not be silently emitted with exit 0. This is a secondary defect that lets the primary bug ship to end users via apr serve and apr code.

Severity

P0 — release-blocker for v0.35.0. The performance table in README.md claims "Qwen2.5-Coder 7B Q4_K 225+ tok/s RTX 4090", which is the canonical demo configuration for the project. apr qa already detects the bug as a Golden Output failure; shipping v0.35.0 to crates.io would push the regression to every user who tries the headline 7B model on GPU.

Suggested next steps

  1. Bisect v0.34.0 (2026-05-18 tag, 7e3c9cfa5) → HEAD (0d8d52b, 71 commits) for the first commit that fails apr qa /path/to/qwen2.5-coder-7b-instruct-q4_k_m.gguf Golden Output gate. Candidates from git log v0.34.0..HEAD -- crates/aprender-serve/src/api/cuda_chat_backend.rs:
  2. Tighten the parity gate to cover the partial-CPU-fallback case (currently only triggers on full GPU run).
  3. Make apr run exit non-zero when the parity gate rejects the GPU output — gibberish should never ship with exit 0.

Artifacts

  • Host: noah-Lambda-Vector (RTX 4090, sm_89, wgpu via Vulkan, CUDA 12.x, FP8 cuBLASLt available)
  • Model: qwen2.5-coder-7b-instruct-q4_k_m.gguf (4.7 GB)
  • Build: HEAD = 0d8d52b (release/v0.35.0 worktree)

Surfaced by v0.35.0 release dogfood, 2026-05-22.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions