Skip to content

test: document QJL regression in test_turboquant_improves_over_polarquant#62

Closed
brosequist wants to merge 1 commit into
TheTom:mainfrom
brosequist:test/qjl-regression-assertion
Closed

test: document QJL regression in test_turboquant_improves_over_polarquant#62
brosequist wants to merge 1 commit into
TheTom:mainfrom
brosequist:test/qjl-regression-assertion

Conversation

@brosequist

Copy link
Copy Markdown

Summary

  • test_turboquant_improves_over_polarquant had no assertion — it only printed values. This adds a regression assertion documenting the empirical finding that QJL increases inner-product distortion (TQ 2-bit avg ≈ 0.091 vs PQ 2-bit avg ≈ 0.041).
  • The assertion is written as a regression guard: if QJL is ever fixed to actually improve things, the test will fail with a clear message prompting re-evaluation.

Test plan

  • pytest tests/test_distortion.py::TestDistortionScaling::test_turboquant_improves_over_polarquant

🤖 Generated with Claude Code

…uant

The existing test ended with a print() and no assertion, silently allowing QJL
to be worse than PolarQuant. This updates the test to assert the known finding:
QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same
bit budget. The assertion will alert if QJL is ever fixed and starts winning,
prompting re-evaluation of the production path. See turbo4-resurrection.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clonekang pushed a commit to clonekang/turboquant_llama that referenced this pull request Apr 12, 2026
vulkan: fix and complete turbo3 KV cache support
clonekang pushed a commit to clonekang/turboquant_llama that referenced this pull request May 3, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
clonekang pushed a commit to clonekang/turboquant_llama that referenced this pull request May 3, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
@brosequist brosequist closed this May 9, 2026
@brosequist brosequist deleted the test/qjl-regression-assertion branch May 9, 2026 08:32
markfietje pushed a commit to markfietje/turboquant_plus that referenced this pull request Jun 8, 2026
…b9190 upstream sync)

The b9190 upstream-sync merge (e30bbcf) clobbered the Vulkan turbo
KV cache support that landed in PR TheTom#62 (ff8bb73). Shaders survived,
but the C++ dispatch path in ggml-vulkan.cpp was dropped during merge
conflict resolution. This restores the lost wiring, adapts it to the
post-sync Vulkan API, and adds two fixes that surface on RDNA4.

Closes TheTom#88, TheTom#89. Tracking: #159.

Restored / added:
- pipeline_turbo_wht member + creation, ggml_vk_turbo_wht dispatch
  fn, GGML_OP_TURBO_WHT op handler. Reuses existing turbo_wht.comp
  shader. Explicit ggml_vk_sync_buffers barriers around dispatch for
  compute-to-compute RAW/WAW ordering.
- CREATE_FA(GGML_TYPE_TURBO3_0, ...) SPIR-V for scalar + cm1 coopmat
  FA paths (cm2 NV-coopmat skipped, fp32-only for turbo3).
- fa_kv_ok extended to accept turbo2/3/4 in supports_op(FA).
- flash_attn_dequant.glsl: turbo K/V SSBO bindings with restrict +
  std430 to avoid alias-stride issues, FA_DEQUANT4_TURBO{2,3,4}_0
  macros for per-element centroid lookup (no iWHT in kernel — graph
  applies WHT to Q pre-attention and inverse to output post-attn).
- Dedicated DATA_A_TURBO3_0 SPIR-V variant gates away the generic
  f16/q4/q5/q8 binding aliases — eliminates SSBO alias collision on
  RDNA where mismatched stride aliases at the same binding caused
  driver mis-strided loads.
- memoryBarrierShared() added in flash_attn_cm1.comp at shared-mem
  load sites for kvsh to fix the cm1 coopmat read-before-write race.
- flash_attn.comp + flash_attn_cm1.comp gate f16 K/V aliases under
  #if !defined(DATA_A_TURBO3_0).
- fa_block_elems extended with turbo2/3/4 cases (types 42/43/44).
- SET_ROWS pipelines registered for turbo2/turbo3/turbo4, with the
  required head_dim%128 guard in supports_op.
- op_f32 dispatch math special-cases turbo: 128 threads per WG
  handling one full QK=128 block, so ne = ne/128 rather than the
  generic CEIL_DIV(ne, 32 * blck_size).

Two root causes (out of ~8 hypotheses):

1. requiredSubgroupSize=32 on turbo set_rows pipelines. RDNA4 driver
   picks wave64 by default for compute pipelines unless explicitly
   constrained. The turbo set_rows shader uses subgroupBallot to pack
   32 sign bits into a single uvec4.x word; on wave64 the ballot
   covered 64 lanes and the sign packing was off by 32, silently
   producing garbage in cache writes. Pinning to wave32 via
   VkPipelineShaderStageRequiredSubgroupSizeCreateInfo makes the
   ballot semantics deterministic. This is THE root cause of TheTom#88's
   incoherent decode on RDNA4.

2. ne = ne/128 dispatch math. turbo set_rows runs one full QK=128
   block per workgroup with 128 threads. Generic quant path uses 32
   threads per WG with CEIL_DIV(ne, 32 * blck_size). Without the
   special case, the turbo path was under-dispatched by 32x, leaving
   most of the cache uninitialized. This was the missing line from
   the original April implementation in ff8bb73 — present then,
   dropped during b9190 sync.

Verification on RX 9070 XT (RDNA4, gfx1201), Qwen2.5-7B Q4_K_M:

  Config                                  pp512    tg128   Status
  --cache-type-k f16 --cache-type-v f16   2636 t/s 48.4    no regression
  --cache-type-k q8_0 --cache-type-v turbo4  1951 t/s 49.2 coherent
  --cache-type-k q8_0 --cache-type-v turbo3   446 t/s 45.6 coherent
  --cache-type-k q8_0 --cache-type-v turbo2   453 t/s 45.2 coherent

All decode against HIP turbo3 baseline produces coherent output.
Prefill on Vulkan now lands at ~3.3x the previously-measured HIP
throughput on the same hardware. Decode is ~56% of HIP; tuning
headroom for follow-up.

#137 (RDNA 3.5 GFX1151 case) not retested — requires Strix Halo
hardware. Likely fixed by the same requiredSubgroupSize=32 fix
since RDNA 3.5 has the same wave-flexible scheduling, but unverified.

Co-Authored-By: Simon Gardling <titaniumtown@proton.me>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant