fix: CUDA warp-to-block mapping for block_size=128#32
Merged
TheTom merged 1 commit intoTheTom:feature/turboquant-kv-cachefrom Mar 30, 2026
Merged
Conversation
The block_size=128 change (adac2c6) broke CUDA quantization: with QK=128, blocks_per_group=1, but the warp-cooperative packing still used blk_base+warp_id, causing warps 1-3 to write OOB. Fix: compute elem_in_block = j % QK_TURBO_N and use it for block pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs, elem_in_block / 8 for signs). Works for both QK=32 and QK=128. Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3: PPL = 7.587 (matches QK=32 baseline exactly).
Owner
|
Looking now |
Owner
|
Thank you for the contribution! Apologies for the regression |
mihai-chiorean
pushed a commit
to mihai-chiorean/turbo3-cuda
that referenced
this pull request
Mar 31, 2026
…n data Part of TheTom#32: turbo3 prefill degrades relative to q8_0 with context length. Changes so far: - Skip ggml_cont when tensors already contiguous (+1%, minimal) - Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for reduced group size approach (16x less matmul compute) - Fixed V un-rotation to check v->type not k->type Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
terrysimons
pushed a commit
to terrysimons/llama-cpp-turboquant
that referenced
this pull request
Mar 31, 2026
…n data Part of TheTom#32: turbo3 prefill degrades relative to q8_0 with context length. Changes so far: - Skip ggml_cont when tensors already contiguous (+1%, minimal) - Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for reduced group size approach (16x less matmul compute) - Fixed V un-rotation to check v->type not k->type Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
…n data Part of #32: turbo3 prefill degrades relative to q8_0 with context length. Changes so far: - Skip ggml_cont when tensors already contiguous (+1%, minimal) - Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for reduced group size approach (16x less matmul compute) - Fixed V un-rotation to check v->type not k->type Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
…n data Part of #32: turbo3 prefill degrades relative to q8_0 with context length. Changes so far: - Skip ggml_cont when tensors already contiguous (+1%, minimal) - Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for reduced group size approach (16x less matmul compute) - Fixed V un-rotation to check v->type not k->type Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The block_size=128 change (adac2c6) broke CUDA quantization in
set-rows.cu. WithQK_TURBO3=128,blocks_per_group = 1, but the warp-cooperative packing still computedblk = blk_base + warp_id— warps 1-3 wroteqs,signs, andnormout of bounds, corrupting adjacent KV cache memory.Short
llama-benchruns (pp512, tg128) could appear to pass because the OOB writes don't immediately affect the active attention window.llama-perplexityover full WikiText-2 produces all-NaN or segfaults.Fix
Compute element position within the block generically:
elem_in_block / 4forqsbyte offset (range 0..31 for QK=128, 0..7 for QK=32)elem_in_block / 8forsignsbyte offset (range 0..15 for QK=128, 0..3 for QK=32)elem_in_block == 0for norm write gate (one per block)Backward compatible — produces identical results with QK=32. Same fix applied to
k_set_rows_turbo2.Validation
RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3, WikiText-2, 512 context:
Zero deviation across all three conditions. The 5.12x compression ratio is now validated on CUDA Ampere.
Context
@seanrasch independently reported the same crash on SM 86 in Discussion ggml-org#20969.