test: document QJL regression in test_turboquant_improves_over_polarquant by brosequist · Pull Request #62 · TheTom/turboquant_plus

brosequist · 2026-04-01T20:24:13Z

Summary

test_turboquant_improves_over_polarquant had no assertion — it only printed values. This adds a regression assertion documenting the empirical finding that QJL increases inner-product distortion (TQ 2-bit avg ≈ 0.091 vs PQ 2-bit avg ≈ 0.041).
The assertion is written as a regression guard: if QJL is ever fixed to actually improve things, the test will fail with a clear message prompting re-evaluation.

Test plan

pytest tests/test_distortion.py::TestDistortionScaling::test_turboquant_improves_over_polarquant

🤖 Generated with Claude Code

…uant The existing test ended with a print() and no assertion, silently allowing QJL to be worse than PolarQuant. This updates the test to assert the known finding: QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same bit budget. The assertion will alert if QJL is ever fixed and starts winning, prompting re-evaluation of the production path. See turbo4-resurrection.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vulkan: fix and complete turbo3 KV cache support

Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR #21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a #include "rte.glsl" in a shader whose header no longer exists, and MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81

…b9190 upstream sync) The b9190 upstream-sync merge (e30bbcf) clobbered the Vulkan turbo KV cache support that landed in PR TheTom#62 (ff8bb73). Shaders survived, but the C++ dispatch path in ggml-vulkan.cpp was dropped during merge conflict resolution. This restores the lost wiring, adapts it to the post-sync Vulkan API, and adds two fixes that surface on RDNA4. Closes TheTom#88, TheTom#89. Tracking: #159. Restored / added: - pipeline_turbo_wht member + creation, ggml_vk_turbo_wht dispatch fn, GGML_OP_TURBO_WHT op handler. Reuses existing turbo_wht.comp shader. Explicit ggml_vk_sync_buffers barriers around dispatch for compute-to-compute RAW/WAW ordering. - CREATE_FA(GGML_TYPE_TURBO3_0, ...) SPIR-V for scalar + cm1 coopmat FA paths (cm2 NV-coopmat skipped, fp32-only for turbo3). - fa_kv_ok extended to accept turbo2/3/4 in supports_op(FA). - flash_attn_dequant.glsl: turbo K/V SSBO bindings with restrict + std430 to avoid alias-stride issues, FA_DEQUANT4_TURBO{2,3,4}_0 macros for per-element centroid lookup (no iWHT in kernel — graph applies WHT to Q pre-attention and inverse to output post-attn). - Dedicated DATA_A_TURBO3_0 SPIR-V variant gates away the generic f16/q4/q5/q8 binding aliases — eliminates SSBO alias collision on RDNA where mismatched stride aliases at the same binding caused driver mis-strided loads. - memoryBarrierShared() added in flash_attn_cm1.comp at shared-mem load sites for kvsh to fix the cm1 coopmat read-before-write race. - flash_attn.comp + flash_attn_cm1.comp gate f16 K/V aliases under #if !defined(DATA_A_TURBO3_0). - fa_block_elems extended with turbo2/3/4 cases (types 42/43/44). - SET_ROWS pipelines registered for turbo2/turbo3/turbo4, with the required head_dim%128 guard in supports_op. - op_f32 dispatch math special-cases turbo: 128 threads per WG handling one full QK=128 block, so ne = ne/128 rather than the generic CEIL_DIV(ne, 32 * blck_size). Two root causes (out of ~8 hypotheses): 1. requiredSubgroupSize=32 on turbo set_rows pipelines. RDNA4 driver picks wave64 by default for compute pipelines unless explicitly constrained. The turbo set_rows shader uses subgroupBallot to pack 32 sign bits into a single uvec4.x word; on wave64 the ballot covered 64 lanes and the sign packing was off by 32, silently producing garbage in cache writes. Pinning to wave32 via VkPipelineShaderStageRequiredSubgroupSizeCreateInfo makes the ballot semantics deterministic. This is THE root cause of TheTom#88's incoherent decode on RDNA4. 2. ne = ne/128 dispatch math. turbo set_rows runs one full QK=128 block per workgroup with 128 threads. Generic quant path uses 32 threads per WG with CEIL_DIV(ne, 32 * blck_size). Without the special case, the turbo path was under-dispatched by 32x, leaving most of the cache uninitialized. This was the missing line from the original April implementation in ff8bb73 — present then, dropped during b9190 sync. Verification on RX 9070 XT (RDNA4, gfx1201), Qwen2.5-7B Q4_K_M: Config pp512 tg128 Status --cache-type-k f16 --cache-type-v f16 2636 t/s 48.4 no regression --cache-type-k q8_0 --cache-type-v turbo4 1951 t/s 49.2 coherent --cache-type-k q8_0 --cache-type-v turbo3 446 t/s 45.6 coherent --cache-type-k q8_0 --cache-type-v turbo2 453 t/s 45.2 coherent All decode against HIP turbo3 baseline produces coherent output. Prefill on Vulkan now lands at ~3.3x the previously-measured HIP throughput on the same hardware. Decode is ~56% of HIP; tuning headroom for follow-up. #137 (RDNA 3.5 GFX1151 case) not retested — requires Strix Halo hardware. Likely fixed by the same requiredSubgroupSize=32 fix since RDNA 3.5 has the same wave-flexible scheduling, but unverified. Co-Authored-By: Simon Gardling <titaniumtown@proton.me> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clonekang pushed a commit to clonekang/turboquant_llama that referenced this pull request Apr 12, 2026

Merge pull request TheTom#62 from Titaniumtown/pr/turboquant-vulkan-work

8590cbf

vulkan: fix and complete turbo3 KV cache support

brosequist closed this May 9, 2026

brosequist deleted the test/qjl-regression-assertion branch May 9, 2026 08:32

brosequist mentioned this pull request May 9, 2026

Accepted fixes & tests: V/K-norm size accounting, SeedSequence cleanup, QJL regression guard, rotation + ruff #90

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: document QJL regression in test_turboquant_improves_over_polarquant#62

test: document QJL regression in test_turboquant_improves_over_polarquant#62
brosequist wants to merge 1 commit into
TheTom:mainfrom
brosequist:test/qjl-regression-assertion

brosequist commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brosequist commented Apr 1, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant