test: document QJL regression in test_turboquant_improves_over_polarquant#62
Closed
brosequist wants to merge 1 commit into
Closed
test: document QJL regression in test_turboquant_improves_over_polarquant#62brosequist wants to merge 1 commit into
brosequist wants to merge 1 commit into
Conversation
…uant The existing test ended with a print() and no assertion, silently allowing QJL to be worse than PolarQuant. This updates the test to assert the known finding: QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same bit budget. The assertion will alert if QJL is ever fixed and starts winning, prompting re-evaluation of the production path. See turbo4-resurrection.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clonekang
pushed a commit
to clonekang/turboquant_llama
that referenced
this pull request
Apr 12, 2026
vulkan: fix and complete turbo3 KV cache support
clonekang
pushed a commit
to clonekang/turboquant_llama
that referenced
this pull request
May 3, 2026
Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR #21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a #include "rte.glsl" in a shader whose header no longer exists, and MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
clonekang
pushed a commit
to clonekang/turboquant_llama
that referenced
this pull request
May 3, 2026
Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR #21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a #include "rte.glsl" in a shader whose header no longer exists, and MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
6 tasks
markfietje
pushed a commit
to markfietje/turboquant_plus
that referenced
this pull request
Jun 8, 2026
…b9190 upstream sync) The b9190 upstream-sync merge (e30bbcf) clobbered the Vulkan turbo KV cache support that landed in PR TheTom#62 (ff8bb73). Shaders survived, but the C++ dispatch path in ggml-vulkan.cpp was dropped during merge conflict resolution. This restores the lost wiring, adapts it to the post-sync Vulkan API, and adds two fixes that surface on RDNA4. Closes TheTom#88, TheTom#89. Tracking: #159. Restored / added: - pipeline_turbo_wht member + creation, ggml_vk_turbo_wht dispatch fn, GGML_OP_TURBO_WHT op handler. Reuses existing turbo_wht.comp shader. Explicit ggml_vk_sync_buffers barriers around dispatch for compute-to-compute RAW/WAW ordering. - CREATE_FA(GGML_TYPE_TURBO3_0, ...) SPIR-V for scalar + cm1 coopmat FA paths (cm2 NV-coopmat skipped, fp32-only for turbo3). - fa_kv_ok extended to accept turbo2/3/4 in supports_op(FA). - flash_attn_dequant.glsl: turbo K/V SSBO bindings with restrict + std430 to avoid alias-stride issues, FA_DEQUANT4_TURBO{2,3,4}_0 macros for per-element centroid lookup (no iWHT in kernel — graph applies WHT to Q pre-attention and inverse to output post-attn). - Dedicated DATA_A_TURBO3_0 SPIR-V variant gates away the generic f16/q4/q5/q8 binding aliases — eliminates SSBO alias collision on RDNA where mismatched stride aliases at the same binding caused driver mis-strided loads. - memoryBarrierShared() added in flash_attn_cm1.comp at shared-mem load sites for kvsh to fix the cm1 coopmat read-before-write race. - flash_attn.comp + flash_attn_cm1.comp gate f16 K/V aliases under #if !defined(DATA_A_TURBO3_0). - fa_block_elems extended with turbo2/3/4 cases (types 42/43/44). - SET_ROWS pipelines registered for turbo2/turbo3/turbo4, with the required head_dim%128 guard in supports_op. - op_f32 dispatch math special-cases turbo: 128 threads per WG handling one full QK=128 block, so ne = ne/128 rather than the generic CEIL_DIV(ne, 32 * blck_size). Two root causes (out of ~8 hypotheses): 1. requiredSubgroupSize=32 on turbo set_rows pipelines. RDNA4 driver picks wave64 by default for compute pipelines unless explicitly constrained. The turbo set_rows shader uses subgroupBallot to pack 32 sign bits into a single uvec4.x word; on wave64 the ballot covered 64 lanes and the sign packing was off by 32, silently producing garbage in cache writes. Pinning to wave32 via VkPipelineShaderStageRequiredSubgroupSizeCreateInfo makes the ballot semantics deterministic. This is THE root cause of TheTom#88's incoherent decode on RDNA4. 2. ne = ne/128 dispatch math. turbo set_rows runs one full QK=128 block per workgroup with 128 threads. Generic quant path uses 32 threads per WG with CEIL_DIV(ne, 32 * blck_size). Without the special case, the turbo path was under-dispatched by 32x, leaving most of the cache uninitialized. This was the missing line from the original April implementation in ff8bb73 — present then, dropped during b9190 sync. Verification on RX 9070 XT (RDNA4, gfx1201), Qwen2.5-7B Q4_K_M: Config pp512 tg128 Status --cache-type-k f16 --cache-type-v f16 2636 t/s 48.4 no regression --cache-type-k q8_0 --cache-type-v turbo4 1951 t/s 49.2 coherent --cache-type-k q8_0 --cache-type-v turbo3 446 t/s 45.6 coherent --cache-type-k q8_0 --cache-type-v turbo2 453 t/s 45.2 coherent All decode against HIP turbo3 baseline produces coherent output. Prefill on Vulkan now lands at ~3.3x the previously-measured HIP throughput on the same hardware. Decode is ~56% of HIP; tuning headroom for follow-up. #137 (RDNA 3.5 GFX1151 case) not retested — requires Strix Halo hardware. Likely fixed by the same requiredSubgroupSize=32 fix since RDNA 3.5 has the same wave-flexible scheduling, but unverified. Co-Authored-By: Simon Gardling <titaniumtown@proton.me> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_turboquant_improves_over_polarquanthad no assertion — it only printed values. This adds a regression assertion documenting the empirical finding that QJL increases inner-product distortion (TQ 2-bit avg ≈ 0.091 vs PQ 2-bit avg ≈ 0.041).Test plan
pytest tests/test_distortion.py::TestDistortionScaling::test_turboquant_improves_over_polarquant🤖 Generated with Claude Code