Skip to content

fix: stack overflow, non-128-aligned head dims, constant coupling#18

Closed
seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
seanrasch:fix/issue-29-v2
Closed

fix: stack overflow, non-128-aligned head dims, constant coupling#18
seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
seanrasch:fix/issue-29-v2

Conversation

@seanrasch
Copy link
Copy Markdown

Resubmission of the non-breaking fixes from PR #4, rebased on current HEAD (turbo4 4-bit PolarQuant).

Dropped the turbo3/turbo4 kernel split that caused the regression. The turbo4 fixes are obsolete now that turbo4 has been redesigned. What's left:

Changes

  1. Stack overflow in turbo_init_rotation() — the 64KB stack-local float G[] segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy.

  2. Non-128-aligned head dims (fixes CUDA SET_ROWS turbo3: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13) — n_groups = args.nk0 / blocks_per_group silently drops tail blocks for models like GLM-4.7 Flash (head_dim=576, 18 blocks → 4.5 groups) and deepseek2 K heads (head_dim=192, 6 blocks → 1.5 groups). Fix: ceiling division + tail handling in the shared kernel_set_rows_turbo template. All loop bounds remain at compile-time QK_TURBO3_GROUP (128) with (j < grp_elems) conditionals for zero-padding — avoids dynamic loop bounds that may have caused the v1 regression.

  3. Constant coupling — standalone TURBO_D 128 replaced with TURBO_ROT_DIM = QK_TURBO3_GROUP + static_assert(QK_TURBO4 == QK_TURBO3_GROUP) guard.

  4. WHT dispatch assertGGML_ASSERT(n_elements % 128 == 0) in ggml_metal_op_turbo_wht for early failure with clear error message.

What's NOT included (vs PR #4)

  • No turbo3/turbo4 kernel split (caused the PPL regression)
  • No turbo4 SET_ROWS fixes (turbo4 completely redesigned)

Testing notes

For 128-aligned models (the common case), grp_elems == 128 and grp_blocks == blocks_per_group — the conditionals are always true and the break never fires. Zero behavior change for existing models.

🤖 Generated with Claude Code

seanrasch and others added 2 commits March 28, 2026 18:40
Two fixes in ggml-turbo-quant.c:

1. Stack overflow: turbo_init_rotation() allocated a 64KB float G[] on
   the stack then memcpy'd to turbo_rotation. On llama.cpp worker threads
   with reduced stack sizes (512KB macOS, 64KB some Linux), this segfaults.
   Fix: generate directly into turbo_rotation, eliminating both the stack
   allocation and the memcpy.

2. Constant coupling: TURBO_D was a standalone #define 128, duplicating
   QK_TURBO3_GROUP from ggml-common.h. Replace with TURBO_ROT_DIM aliased
   to QK_TURBO3_GROUP. Add static_assert(QK_TURBO4 == QK_TURBO3_GROUP) to
   guard turbo4 code that assumes block size == rotation group size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models with head_dim not divisible by 128 (e.g. GLM-4.7 Flash head_dim=576
→ 576/32=18 blocks, 18/4=4.5 groups; deepseek2 K head_dim=192 → 6 blocks,
1.5 groups) silently dropped tail blocks because n_groups used floor
division.

Fix: use ceiling division for n_groups, add grp_blocks/grp_elems to handle
partial tail groups. For the common 128-aligned case, the tail-handling
conditionals are always true and the break never fires — zero behavior
change for existing models.

Design note: keep all loop bounds at compile-time QK_TURBO3_GROUP (128)
and use (j < grp_elems) conditionals for zero-padding, rather than
variable-bound loops that could affect Metal compiler codegen.

Also adds GGML_ASSERT in WHT dispatch to catch non-128-aligned elements
early with a clear error message.

Fixes: TheTom#13 (GGML_ASSERT for non-128-aligned head dims)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 29, 2026

Status check against current HEAD (172fc85)

Reviewed PR #18's three fixes against tip-of-tree after merging signalnine's CUDA port (PR #3).

1. Stack overflow in turbo_init_rotation()⚠️ STILL PRESENT

float G[128*128] (64KB) is still allocated on the stack at ggml-turbo-quant.c:68. This will segfault on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux configs). The fix to generate directly into turbo_rotation and eliminate the stack allocation + memcpy is still needed.

2. Non-128-aligned head dims — ✅ FIXED (different approach)

signalnine's PR #3 addressed this with:

  • q8_0 fallback in llama-kv-cache.cpp when n_embd_head_k % 128 != 0
  • 64-element WHT groups for non-standard head dims
  • ggml_turbo_wht auto-detects group size (64 or 128) from tensor dimensions

Different fix than PR #18's tail-handling approach in the Metal SET_ROWS kernel, but the problem is resolved.

3. Constant coupling (TURBO_D 128) — ⚠️ MINOR, CPU-ONLY

TURBO_D is still a standalone #define in ggml-turbo-quant.c:22, not tied to QK_TURBO3_GROUP. Metal shader no longer uses TURBO_D (uses QK_TURBO3_GROUP directly). Low risk since it's only in the CPU reference path and both are 128, but the static_assert guard would be nice.

4. WHT dispatch assert — ✅ COVERED

ggml_turbo_wht in ggml.c now asserts a->ne[0] % group_size == 0 at graph construction time (from signalnine's changes). Covers the same case as the Metal-side assert, just earlier in the pipeline.

Recommendation

This PR needs a rebase onto current HEAD. The only fix still needed is #1 (stack overflow). The constant coupling fix (#3) is a nice-to-have. #2 and #4 are already resolved.

@seanrasch — if you want to resubmit, a minimal PR with just the turbo_init_rotation() stack fix would merge cleanly.

seanrasch added a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 29, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack
   then memcpy'd into the static turbo_rotation array. This segfaults on
   llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB
   some Linux). Fix: generate the Gaussian matrix directly into
   turbo_rotation, eliminating both the stack allocation and the memcpy.

2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always
   match (both represent the rotation group size). Add static_assert to
   catch silent divergence between CPU reference and GPU kernels.

Fixes: TheTom#29 (remaining items from PR TheTom#18 review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
seanrasch added a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 31, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack
   then memcpy'd into the static turbo_rotation array. This segfaults on
   llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB
   some Linux). Fix: generate the Gaussian matrix directly into
   turbo_rotation, eliminating both the stack allocation and the memcpy.

2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always
   match (both represent the rotation group size). Add static_assert to
   catch silent divergence between CPU reference and GPU kernels.

Fixes: TheTom#29 (remaining items from PR TheTom#18 review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
seanrasch added a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 31, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack
   then memcpy'd into the static turbo_rotation array. This segfaults on
   llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB
   some Linux). Fix: generate the Gaussian matrix directly into
   turbo_rotation, eliminating both the stack allocation and the memcpy.

2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always
   match (both represent the rotation group size). Add static_assert to
   catch silent divergence between CPU reference and GPU kernels.

Fixes: TheTom#29 (remaining items from PR TheTom#18 review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@seanrasch
Copy link
Copy Markdown
Author

Closing — superseded by PR #23 (minimal resubmission per your review feedback). Only the stack overflow fix + static_assert remain relevant, and those are in #23.

@seanrasch seanrasch closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants