Skip to content

fix: stack overflow, non-128-aligned head dims, constant coupling#18

Closed
seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
seanrasch:fix/issue-29-v2
Closed

fix: stack overflow, non-128-aligned head dims, constant coupling#18
seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
seanrasch:fix/issue-29-v2

Conversation

@seanrasch
Copy link
Copy Markdown

Resubmission of the non-breaking fixes from PR #4, rebased on current HEAD (turbo4 4-bit PolarQuant).

Dropped the turbo3/turbo4 kernel split that caused the regression. The turbo4 fixes are obsolete now that turbo4 has been redesigned. What's left:

Changes

  1. Stack overflow in turbo_init_rotation() — the 64KB stack-local float G[] segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy.

  2. Non-128-aligned head dims (fixes CUDA SET_ROWS turbo3: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13) — n_groups = args.nk0 / blocks_per_group silently drops tail blocks for models like GLM-4.7 Flash (head_dim=576, 18 blocks → 4.5 groups) and deepseek2 K heads (head_dim=192, 6 blocks → 1.5 groups). Fix: ceiling division + tail handling in the shared kernel_set_rows_turbo template. All loop bounds remain at compile-time QK_TURBO3_GROUP (128) with (j < grp_elems) conditionals for zero-padding — avoids dynamic loop bounds that may have caused the v1 regression.

  3. Constant coupling — standalone TURBO_D 128 replaced with TURBO_ROT_DIM = QK_TURBO3_GROUP + static_assert(QK_TURBO4 == QK_TURBO3_GROUP) guard.

  4. WHT dispatch assertGGML_ASSERT(n_elements % 128 == 0) in ggml_metal_op_turbo_wht for early failure with clear error message.

What's NOT included (vs PR #4)

  • No turbo3/turbo4 kernel split (caused the PPL regression)
  • No turbo4 SET_ROWS fixes (turbo4 completely redesigned)

Testing notes

For 128-aligned models (the common case), grp_elems == 128 and grp_blocks == blocks_per_group — the conditionals are always true and the break never fires. Zero behavior change for existing models.

🤖 Generated with Claude Code

seanrasch and others added 2 commits March 28, 2026 18:40
Two fixes in ggml-turbo-quant.c:

1. Stack overflow: turbo_init_rotation() allocated a 64KB float G[] on
   the stack then memcpy'd to turbo_rotation. On llama.cpp worker threads
   with reduced stack sizes (512KB macOS, 64KB some Linux), this segfaults.
   Fix: generate directly into turbo_rotation, eliminating both the stack
   allocation and the memcpy.

2. Constant coupling: TURBO_D was a standalone #define 128, duplicating
   QK_TURBO3_GROUP from ggml-common.h. Replace with TURBO_ROT_DIM aliased
   to QK_TURBO3_GROUP. Add static_assert(QK_TURBO4 == QK_TURBO3_GROUP) to
   guard turbo4 code that assumes block size == rotation group size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models with head_dim not divisible by 128 (e.g. GLM-4.7 Flash head_dim=576
→ 576/32=18 blocks, 18/4=4.5 groups; deepseek2 K head_dim=192 → 6 blocks,
1.5 groups) silently dropped tail blocks because n_groups used floor
division.

Fix: use ceiling division for n_groups, add grp_blocks/grp_elems to handle
partial tail groups. For the common 128-aligned case, the tail-handling
conditionals are always true and the break never fires — zero behavior
change for existing models.

Design note: keep all loop bounds at compile-time QK_TURBO3_GROUP (128)
and use (j < grp_elems) conditionals for zero-padding, rather than
variable-bound loops that could affect Metal compiler codegen.

Also adds GGML_ASSERT in WHT dispatch to catch non-128-aligned elements
early with a clear error message.

Fixes: TheTom#13 (GGML_ASSERT for non-128-aligned head dims)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 29, 2026

Status check against current HEAD (172fc85)

Reviewed PR #18's three fixes against tip-of-tree after merging signalnine's CUDA port (PR #3).

1. Stack overflow in turbo_init_rotation()⚠️ STILL PRESENT

float G[128*128] (64KB) is still allocated on the stack at ggml-turbo-quant.c:68. This will segfault on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux configs). The fix to generate directly into turbo_rotation and eliminate the stack allocation + memcpy is still needed.

2. Non-128-aligned head dims — ✅ FIXED (different approach)

signalnine's PR #3 addressed this with:

  • q8_0 fallback in llama-kv-cache.cpp when n_embd_head_k % 128 != 0
  • 64-element WHT groups for non-standard head dims
  • ggml_turbo_wht auto-detects group size (64 or 128) from tensor dimensions

Different fix than PR #18's tail-handling approach in the Metal SET_ROWS kernel, but the problem is resolved.

3. Constant coupling (TURBO_D 128) — ⚠️ MINOR, CPU-ONLY

TURBO_D is still a standalone #define in ggml-turbo-quant.c:22, not tied to QK_TURBO3_GROUP. Metal shader no longer uses TURBO_D (uses QK_TURBO3_GROUP directly). Low risk since it's only in the CPU reference path and both are 128, but the static_assert guard would be nice.

4. WHT dispatch assert — ✅ COVERED

ggml_turbo_wht in ggml.c now asserts a->ne[0] % group_size == 0 at graph construction time (from signalnine's changes). Covers the same case as the Metal-side assert, just earlier in the pipeline.

Recommendation

This PR needs a rebase onto current HEAD. The only fix still needed is #1 (stack overflow). The constant coupling fix (#3) is a nice-to-have. #2 and #4 are already resolved.

@seanrasch — if you want to resubmit, a minimal PR with just the turbo_init_rotation() stack fix would merge cleanly.

seanrasch added a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 29, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack
   then memcpy'd into the static turbo_rotation array. This segfaults on
   llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB
   some Linux). Fix: generate the Gaussian matrix directly into
   turbo_rotation, eliminating both the stack allocation and the memcpy.

2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always
   match (both represent the rotation group size). Add static_assert to
   catch silent divergence between CPU reference and GPU kernels.

Fixes: TheTom#29 (remaining items from PR TheTom#18 review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
seanrasch added a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 31, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack
   then memcpy'd into the static turbo_rotation array. This segfaults on
   llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB
   some Linux). Fix: generate the Gaussian matrix directly into
   turbo_rotation, eliminating both the stack allocation and the memcpy.

2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always
   match (both represent the rotation group size). Add static_assert to
   catch silent divergence between CPU reference and GPU kernels.

Fixes: TheTom#29 (remaining items from PR TheTom#18 review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
seanrasch added a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 31, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack
   then memcpy'd into the static turbo_rotation array. This segfaults on
   llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB
   some Linux). Fix: generate the Gaussian matrix directly into
   turbo_rotation, eliminating both the stack allocation and the memcpy.

2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always
   match (both represent the rotation group size). Add static_assert to
   catch silent divergence between CPU reference and GPU kernels.

Fixes: TheTom#29 (remaining items from PR TheTom#18 review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@seanrasch
Copy link
Copy Markdown
Author

Closing — superseded by PR #23 (minimal resubmission per your review feedback). Only the stack overflow fix + static_assert remain relevant, and those are in #23.

@seanrasch seanrasch closed this Mar 31, 2026
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Apr 10, 2026
)

* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (TheTom#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (TheTom#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (TheTom#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (TheTom#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (TheTom#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (TheTom#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (TheTom#17)

* meta : formatting, naming, indentation (TheTom#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
oussamaahmia pushed a commit to oussamaahmia/llama-cpp-turboquant-gemma4 that referenced this pull request Apr 13, 2026
)

* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (TheTom#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (TheTom#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (TheTom#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (TheTom#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (TheTom#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (TheTom#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (TheTom#17)

* meta : formatting, naming, indentation (TheTom#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
aminya pushed a commit to aminya/llama-cpp-turboquant that referenced this pull request Apr 25, 2026
…eTom#18)

The turbo KV cache code uses CUDA __device__ constant memory for
runtime-initialized values (InnerQ channel scales, calibration
accumulators, Q² statistics, extraction buffers). These symbols are
per-device in CUDA — each GPU gets its own copy.

All initialization functions (turbo_innerq_init, turbo_innerq_init_fattn,
turbo_innerq_start_calibration, turbo_innerq_finalize_calibration,
turbo_innerq_update_fattn_scales, turbo_q_calibrate_init,
turbo_q_calibrate_finalize, turbo_extract_init, turbo_extract_check_done)
used cudaMemcpyToSymbol without specifying which device to target, so
they only wrote to whichever GPU happened to be active. On multi-GPU
setups with --tensor-split, the other GPU(s) would read uninitialized
constant memory, causing garbage output or illegal memory access crashes.

Standard cache types like q8_0 are unaffected because they use either
compile-time constants or read directly from global memory.

Fix: save/restore the current device and loop over all devices for every
cudaMemcpyToSymbol call in the turbo initialization and calibration paths.

Fixes TheTom#17
Relates to TheTom#12

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants