fix: stack overflow, non-128-aligned head dims, constant coupling by seanrasch · Pull Request #18 · TheTom/llama-cpp-turboquant

seanrasch · 2026-03-28T22:42:32Z

Resubmission of the non-breaking fixes from PR #4, rebased on current HEAD (turbo4 4-bit PolarQuant).

Dropped the turbo3/turbo4 kernel split that caused the regression. The turbo4 fixes are obsolete now that turbo4 has been redesigned. What's left:

Changes

Stack overflow in turbo_init_rotation() — the 64KB stack-local float G[] segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy.
Non-128-aligned head dims (fixes CUDA SET_ROWS turbo3: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13) — n_groups = args.nk0 / blocks_per_group silently drops tail blocks for models like GLM-4.7 Flash (head_dim=576, 18 blocks → 4.5 groups) and deepseek2 K heads (head_dim=192, 6 blocks → 1.5 groups). Fix: ceiling division + tail handling in the shared kernel_set_rows_turbo template. All loop bounds remain at compile-time QK_TURBO3_GROUP (128) with (j < grp_elems) conditionals for zero-padding — avoids dynamic loop bounds that may have caused the v1 regression.
Constant coupling — standalone TURBO_D 128 replaced with TURBO_ROT_DIM = QK_TURBO3_GROUP + static_assert(QK_TURBO4 == QK_TURBO3_GROUP) guard.
WHT dispatch assert — GGML_ASSERT(n_elements % 128 == 0) in ggml_metal_op_turbo_wht for early failure with clear error message.

What's NOT included (vs PR #4)

No turbo3/turbo4 kernel split (caused the PPL regression)
No turbo4 SET_ROWS fixes (turbo4 completely redesigned)

Testing notes

For 128-aligned models (the common case), grp_elems == 128 and grp_blocks == blocks_per_group — the conditionals are always true and the break never fires. Zero behavior change for existing models.

🤖 Generated with Claude Code

Two fixes in ggml-turbo-quant.c: 1. Stack overflow: turbo_init_rotation() allocated a 64KB float G[] on the stack then memcpy'd to turbo_rotation. On llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux), this segfaults. Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. Constant coupling: TURBO_D was a standalone #define 128, duplicating QK_TURBO3_GROUP from ggml-common.h. Replace with TURBO_ROT_DIM aliased to QK_TURBO3_GROUP. Add static_assert(QK_TURBO4 == QK_TURBO3_GROUP) to guard turbo4 code that assumes block size == rotation group size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Models with head_dim not divisible by 128 (e.g. GLM-4.7 Flash head_dim=576 → 576/32=18 blocks, 18/4=4.5 groups; deepseek2 K head_dim=192 → 6 blocks, 1.5 groups) silently dropped tail blocks because n_groups used floor division. Fix: use ceiling division for n_groups, add grp_blocks/grp_elems to handle partial tail groups. For the common 128-aligned case, the tail-handling conditionals are always true and the break never fires — zero behavior change for existing models. Design note: keep all loop bounds at compile-time QK_TURBO3_GROUP (128) and use (j < grp_elems) conditionals for zero-padding, rather than variable-bound loops that could affect Metal compiler codegen. Also adds GGML_ASSERT in WHT dispatch to catch non-128-aligned elements early with a clear error message. Fixes: TheTom#13 (GGML_ASSERT for non-128-aligned head dims) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-03-29T02:54:39Z

Status check against current HEAD (`172fc85`)

Reviewed PR #18's three fixes against tip-of-tree after merging signalnine's CUDA port (PR #3).

1. Stack overflow in `turbo_init_rotation()` — ⚠️ STILL PRESENT

float G[128*128] (64KB) is still allocated on the stack at ggml-turbo-quant.c:68. This will segfault on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux configs). The fix to generate directly into turbo_rotation and eliminate the stack allocation + memcpy is still needed.

2. Non-128-aligned head dims — ✅ FIXED (different approach)

signalnine's PR #3 addressed this with:

q8_0 fallback in llama-kv-cache.cpp when n_embd_head_k % 128 != 0
64-element WHT groups for non-standard head dims
ggml_turbo_wht auto-detects group size (64 or 128) from tensor dimensions

Different fix than PR #18's tail-handling approach in the Metal SET_ROWS kernel, but the problem is resolved.

3. Constant coupling (`TURBO_D 128`) — ⚠️ MINOR, CPU-ONLY

TURBO_D is still a standalone #define in ggml-turbo-quant.c:22, not tied to QK_TURBO3_GROUP. Metal shader no longer uses TURBO_D (uses QK_TURBO3_GROUP directly). Low risk since it's only in the CPU reference path and both are 128, but the static_assert guard would be nice.

4. WHT dispatch assert — ✅ COVERED

ggml_turbo_wht in ggml.c now asserts a->ne[0] % group_size == 0 at graph construction time (from signalnine's changes). Covers the same case as the Metal-side assert, just earlier in the pipeline.

Recommendation

This PR needs a rebase onto current HEAD. The only fix still needed is #1 (stack overflow). The constant coupling fix (#3) is a nice-to-have. #2 and #4 are already resolved.

@seanrasch — if you want to resubmit, a minimal PR with just the turbo_init_rotation() stack fix would merge cleanly.

1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

seanrasch · 2026-03-31T13:17:23Z

Closing — superseded by PR #23 (minimal resubmission per your review feedback). Only the stack overflow fix + static_assert remain relevant, and those are in #23.

seanrasch and others added 2 commits March 28, 2026 18:40

github-actions bot added ggml Apple Metal labels Mar 28, 2026

seanrasch mentioned this pull request Mar 29, 2026

fix: stack overflow in turbo_init_rotation + TURBO_D static_assert #23

Merged

3 tasks

seanrasch closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stack overflow, non-128-aligned head dims, constant coupling#18

fix: stack overflow, non-128-aligned head dims, constant coupling#18
seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
seanrasch:fix/issue-29-v2

seanrasch commented Mar 28, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

seanrasch commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seanrasch commented Mar 28, 2026

Changes

What's NOT included (vs PR #4)

Testing notes

Uh oh!

TheTom commented Mar 29, 2026

Status check against current HEAD (172fc85)

1. Stack overflow in turbo_init_rotation() — ⚠️ STILL PRESENT

2. Non-128-aligned head dims — ✅ FIXED (different approach)

3. Constant coupling (TURBO_D 128) — ⚠️ MINOR, CPU-ONLY

4. WHT dispatch assert — ✅ COVERED

Recommendation

Uh oh!

seanrasch commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Status check against current HEAD (`172fc85`)

1. Stack overflow in `turbo_init_rotation()` — ⚠️ STILL PRESENT

3. Constant coupling (`TURBO_D 128`) — ⚠️ MINOR, CPU-ONLY