fix(gguf): Q5_0/Q5_1 dequant layout matches GGML reference (closes #1623) by noahgift · Pull Request #1656 · paiml/aprender

noahgift · 2026-05-13T12:06:21Z

Summary

Closes #1623. Fixes a real Q5_0/Q5_1 dequantization bug in `crates/aprender-core/src/format/gguf/dequant.rs`.

The bug

For each block (32 elements), the code emitted values in interleaved order `[v0, v1, v0, v1, ...]` and indexed the high-bits register with `i2` / `i2+1`. GGML / llama.cpp instead use two halves:

```
for j in 0..16:
y[j] = low0 (qs[j] & 0x0F) | (qh bit j << 4)
y[j + 16] = low1 (qs[j] >> 4) | (qh bit j+16 << 4)
```

Why tests didn't catch it

The pre-existing `test_dequantize_q5_0_basic` / `_q5_1_basic` tests used all-zero inputs and only asserted result length and `is_finite()`. The interleaved emit produced 32 zeros either way.

Fix

Q5_0: corrected layout (block_start + j and block_start + j + 16), corrected high-bit indices (qh >> j and qh >> (j + 16))
Q5_1: same fix

Both follow llama.cpp's `dequantize_row_q5_0` / `dequantize_row_q5_1` exactly.

Tests

Two new layout-verifying tests:

```
test_dequantize_q5_0_ggml_layout
test_dequantize_q5_1_ggml_layout
```

Both set qh bits at positions 0 and 16, set qs[0] = 0x0F (low0=15, low1=0), and assert that:

element 0 receives the bit-0 high + low0 nibble
element 16 receives the bit-16 high + low1 nibble
intermediate positions receive default zeros

These tests fail under the buggy code and pass under the fix.

Reporter: a Coursera capstone investigation that found the bug while running mixed-quant GGUF through apr run.

Test plan

cargo test -p aprender-core --lib dequantize_q5 — 14/14 pass locally
CI: workspace-test

Note

The issue also called out `prepare_tokens_apr` in aprender-serve. That is a separate prompt-templating bug — out of scope for this PR.

🤖 Generated with Claude Code

…w-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX) §74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's stage-bisection scaffold (CPU vs GPU per-stage statistics analysis). The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout interpretation: Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j), but actual ML weights are stored [output_dim=N, input_dim=K] row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention and PMAT-333 F32 dequantization output). Symptom: GPU read transposed weights → computed y = A^T @ x instead of y = A @ x → systematically anti-correlated logits (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped, CPU mean=-2.42 vs GPU mean=0.013). Fix: rewrite the inner loop to iterate along the K dimension within row block_id: row_base = a_ptr + block_id * K * 4 thread reads A[block_id, t], A[block_id, t+32], ... instead of: col_base = a_ptr + block_id * 4 thread reads A[t, block_id], A[t+32, block_id], ... Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090, default graphed path): PARITY-GATE: PASS (no error from forward_gpu_resident) Throughput @ 128-tok 5-iter decode: 124.6 tok/s AC-SHIP1-007 floor: 30 tok/s Headroom: 4.15× over floor TTFT: 8.39 ms p50 latency: 1016 ms Before PR-E: PARITY-GATE FAILED cos=-0.005190 Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73) GPU CANNOT serve this model After PR-E: PARITY-GATE PASS, default path, NO workarounds 124.6 tok/s, 4.15× over floor Ship-% impact: MODEL-1 ship %: **99% → 100%** 10 of 10 AC-SHIP1-* LIVE-DISCHARGED: SHIP-001 (§72) SHIP-002 (§61) SHIP-003 (§72) SHIP-004 (§72) SHIP-005 (§71) SHIP-006 (§61.8) SHIP-007 (this PR) SHIP-008 (§61) SHIP-009 (§72) SHIP-010 (§72) MODEL-2 ship %: unchanged at 57% (independent track). Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649) → §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's '3-5 PR / 3-5 day' estimate. Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var probe kept as a diagnostic tool (zero behavior change when unset). Test plan: - [x] cargo build --release -p apr-cli --bin apr --features cuda → clean - [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true - [x] apr parity → PARITY-GATE PASS - [ ] CI tests (workspace-test on per-PR runner) Refs: - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract) - PR #1649 (PR-B GPU stage dump scaffold) - AC-SHIP1-007 (spec §5) - evidence/section-75-ship-007-discharged-2026-05-13/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…07 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN) The env-var bisection probe added in PR-E (this branch) introduced a `_ =>` catch-all inside a `match` expression that referenced `WeightQuantType` in its arm values. The `falsify_007_no_catch_all_ in_dispatch_sites` contract test's 30-line walk-back heuristic flagged this as a violation, even though the match was on `&str` (env var value), not on `WeightQuantType`. The probe was a bisection tool used to identify the bug location during §74. Now that §75 has shipped the actual fix and the probe is no longer needed, removing it cleans up the contract violation. The remaining PR-E change is solely the F32 GEMV PTX kernel layout fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the actual bug fix. Test verified: cargo test -p aprender-serve --lib \ quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites → 1 passed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

) The Q5_0 and Q5_1 dequantizers in aprender-core were emitting values in interleaved order [v0, v1, v0, v1, ...] and using wrong high-bit indices (i*2 / i*2+1). GGML / llama.cpp layout is: for j in 0..16: y[j] = low0 (qs[j] & 0x0F) | (qh bit j << 4) y[j + 16] = low1 (qs[j] >> 4) | (qh bit j+16 << 4) Two halves, NOT interleaved. High bit for element j uses bit j; for element j+16 uses bit j+16. Existing tests only checked length and finite-ness — never the layout. Adds two GGML-reference layout tests (`test_dequantize_q5_0_ggml_layout`, `test_dequantize_q5_1_ggml_layout`) that fail under the buggy code and pass under the fix. Reported in #1623 from a Coursera capstone using mixed-quant GGUF.

noahgift and others added 11 commits May 13, 2026 09:33

Merge branch 'main' into fix/ship-007-pr-e-f32-gemv-layout

413971c

Merge branch 'main' into fix/ship-007-pr-e-f32-gemv-layout

cbaf08a

Merge branch 'main' into fix/ship-007-pr-e-f32-gemv-layout

ce8be77

Merge branch 'main' into fix/ship-007-pr-e-f32-gemv-layout

5a8cd3c

Merge branch 'main' into fix/ship-007-pr-e-f32-gemv-layout

8a0f0f6

ci: trigger fresh workflow run for flake-class test re-execution

476eeaf

Merge branch 'main' into fix/ship-007-pr-e-f32-gemv-layout

5e56a10

Merge branch 'main' into fix/ship-007-pr-e-f32-gemv-layout

0a5fe58

noahgift enabled auto-merge (squash) May 13, 2026 12:06

noahgift added 10 commits May 13, 2026 14:24

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

d4895fc

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

2b5eab1

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

0212218

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

3b4e444

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

931a0ba

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

13e53b5

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

bb2a9d9

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

39bba40

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

c1fae4c

Merge branch 'main' into fix/1623-q5-dequant-ggml-layout

d2a6e4c

noahgift merged commit 8ca02aa into main May 14, 2026
10 checks passed

noahgift deleted the fix/1623-q5-dequant-ggml-layout branch May 14, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gguf): Q5_0/Q5_1 dequant layout matches GGML reference (closes #1623)#1656

fix(gguf): Q5_0/Q5_1 dequant layout matches GGML reference (closes #1623)#1656
noahgift merged 21 commits into
mainfrom
fix/1623-q5-dequant-ggml-layout

noahgift commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 13, 2026

Summary

The bug

Why tests didn't catch it

Fix

Tests

Test plan

Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant