fix: stack overflow, non-128-aligned head dims, constant coupling by seanrasch · Pull Request #18 · TheTom/llama-cpp-turboquant

seanrasch · 2026-03-28T22:42:32Z

Resubmission of the non-breaking fixes from PR #4, rebased on current HEAD (turbo4 4-bit PolarQuant).

Dropped the turbo3/turbo4 kernel split that caused the regression. The turbo4 fixes are obsolete now that turbo4 has been redesigned. What's left:

Changes

Stack overflow in turbo_init_rotation() — the 64KB stack-local float G[] segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy.
Non-128-aligned head dims (fixes CUDA SET_ROWS turbo3: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13) — n_groups = args.nk0 / blocks_per_group silently drops tail blocks for models like GLM-4.7 Flash (head_dim=576, 18 blocks → 4.5 groups) and deepseek2 K heads (head_dim=192, 6 blocks → 1.5 groups). Fix: ceiling division + tail handling in the shared kernel_set_rows_turbo template. All loop bounds remain at compile-time QK_TURBO3_GROUP (128) with (j < grp_elems) conditionals for zero-padding — avoids dynamic loop bounds that may have caused the v1 regression.
Constant coupling — standalone TURBO_D 128 replaced with TURBO_ROT_DIM = QK_TURBO3_GROUP + static_assert(QK_TURBO4 == QK_TURBO3_GROUP) guard.
WHT dispatch assert — GGML_ASSERT(n_elements % 128 == 0) in ggml_metal_op_turbo_wht for early failure with clear error message.

What's NOT included (vs PR #4)

No turbo3/turbo4 kernel split (caused the PPL regression)
No turbo4 SET_ROWS fixes (turbo4 completely redesigned)

Testing notes

For 128-aligned models (the common case), grp_elems == 128 and grp_blocks == blocks_per_group — the conditionals are always true and the break never fires. Zero behavior change for existing models.

🤖 Generated with Claude Code

Two fixes in ggml-turbo-quant.c: 1. Stack overflow: turbo_init_rotation() allocated a 64KB float G[] on the stack then memcpy'd to turbo_rotation. On llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux), this segfaults. Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. Constant coupling: TURBO_D was a standalone #define 128, duplicating QK_TURBO3_GROUP from ggml-common.h. Replace with TURBO_ROT_DIM aliased to QK_TURBO3_GROUP. Add static_assert(QK_TURBO4 == QK_TURBO3_GROUP) to guard turbo4 code that assumes block size == rotation group size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Models with head_dim not divisible by 128 (e.g. GLM-4.7 Flash head_dim=576 → 576/32=18 blocks, 18/4=4.5 groups; deepseek2 K head_dim=192 → 6 blocks, 1.5 groups) silently dropped tail blocks because n_groups used floor division. Fix: use ceiling division for n_groups, add grp_blocks/grp_elems to handle partial tail groups. For the common 128-aligned case, the tail-handling conditionals are always true and the break never fires — zero behavior change for existing models. Design note: keep all loop bounds at compile-time QK_TURBO3_GROUP (128) and use (j < grp_elems) conditionals for zero-padding, rather than variable-bound loops that could affect Metal compiler codegen. Also adds GGML_ASSERT in WHT dispatch to catch non-128-aligned elements early with a clear error message. Fixes: TheTom#13 (GGML_ASSERT for non-128-aligned head dims) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-03-29T02:54:39Z

Status check against current HEAD (`172fc85`)

Reviewed PR #18's three fixes against tip-of-tree after merging signalnine's CUDA port (PR #3).

1. Stack overflow in `turbo_init_rotation()` — ⚠️ STILL PRESENT

float G[128*128] (64KB) is still allocated on the stack at ggml-turbo-quant.c:68. This will segfault on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux configs). The fix to generate directly into turbo_rotation and eliminate the stack allocation + memcpy is still needed.

2. Non-128-aligned head dims — ✅ FIXED (different approach)

signalnine's PR #3 addressed this with:

q8_0 fallback in llama-kv-cache.cpp when n_embd_head_k % 128 != 0
64-element WHT groups for non-standard head dims
ggml_turbo_wht auto-detects group size (64 or 128) from tensor dimensions

Different fix than PR #18's tail-handling approach in the Metal SET_ROWS kernel, but the problem is resolved.

3. Constant coupling (`TURBO_D 128`) — ⚠️ MINOR, CPU-ONLY

TURBO_D is still a standalone #define in ggml-turbo-quant.c:22, not tied to QK_TURBO3_GROUP. Metal shader no longer uses TURBO_D (uses QK_TURBO3_GROUP directly). Low risk since it's only in the CPU reference path and both are 128, but the static_assert guard would be nice.

4. WHT dispatch assert — ✅ COVERED

ggml_turbo_wht in ggml.c now asserts a->ne[0] % group_size == 0 at graph construction time (from signalnine's changes). Covers the same case as the Metal-side assert, just earlier in the pipeline.

Recommendation

This PR needs a rebase onto current HEAD. The only fix still needed is #1 (stack overflow). The constant coupling fix (#3) is a nice-to-have. #2 and #4 are already resolved.

@seanrasch — if you want to resubmit, a minimal PR with just the turbo_init_rotation() stack fix would merge cleanly.

1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

seanrasch · 2026-03-31T13:17:23Z

Closing — superseded by PR #23 (minimal resubmission per your review feedback). Only the stack overflow fix + static_assert remain relevant, and those are in #23.

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (TheTom#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (TheTom#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (TheTom#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (TheTom#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (TheTom#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (TheTom#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (TheTom#17) * meta : formatting, naming, indentation (TheTom#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…eTom#18) The turbo KV cache code uses CUDA __device__ constant memory for runtime-initialized values (InnerQ channel scales, calibration accumulators, Q² statistics, extraction buffers). These symbols are per-device in CUDA — each GPU gets its own copy. All initialization functions (turbo_innerq_init, turbo_innerq_init_fattn, turbo_innerq_start_calibration, turbo_innerq_finalize_calibration, turbo_innerq_update_fattn_scales, turbo_q_calibrate_init, turbo_q_calibrate_finalize, turbo_extract_init, turbo_extract_check_done) used cudaMemcpyToSymbol without specifying which device to target, so they only wrote to whichever GPU happened to be active. On multi-GPU setups with --tensor-split, the other GPU(s) would read uninitialized constant memory, causing garbage output or illegal memory access crashes. Standard cache types like q8_0 are unaffected because they use either compile-time constants or read directly from global memory. Fix: save/restore the current device and loop over all devices for every cudaMemcpyToSymbol call in the turbo initialization and calibration paths. Fixes TheTom#17 Relates to TheTom#12 Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

seanrasch and others added 2 commits March 28, 2026 18:40

github-actions Bot added ggml Apple Metal labels Mar 28, 2026

seanrasch mentioned this pull request Mar 29, 2026

fix: stack overflow in turbo_init_rotation + TURBO_D static_assert #23

Merged

3 tasks

seanrasch closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: stack overflow, non-128-aligned head dims, constant coupling#18

fix: stack overflow, non-128-aligned head dims, constant coupling#18
seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
seanrasch:fix/issue-29-v2

seanrasch commented Mar 28, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

seanrasch commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

seanrasch commented Mar 28, 2026

Changes

What's NOT included (vs PR #4)

Testing notes

Uh oh!

TheTom commented Mar 29, 2026

Status check against current HEAD (172fc85)

1. Stack overflow in turbo_init_rotation() — ⚠️ STILL PRESENT

2. Non-128-aligned head dims — ✅ FIXED (different approach)

3. Constant coupling (TURBO_D 128) — ⚠️ MINOR, CPU-ONLY

4. WHT dispatch assert — ✅ COVERED

Recommendation

Uh oh!

seanrasch commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Status check against current HEAD (`172fc85`)

1. Stack overflow in `turbo_init_rotation()` — ⚠️ STILL PRESENT

3. Constant coupling (`TURBO_D 128`) — ⚠️ MINOR, CPU-ONLY