fix: stack overflow, non-128-aligned head dims, constant coupling#18
fix: stack overflow, non-128-aligned head dims, constant coupling#18seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
Conversation
Two fixes in ggml-turbo-quant.c: 1. Stack overflow: turbo_init_rotation() allocated a 64KB float G[] on the stack then memcpy'd to turbo_rotation. On llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux), this segfaults. Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. Constant coupling: TURBO_D was a standalone #define 128, duplicating QK_TURBO3_GROUP from ggml-common.h. Replace with TURBO_ROT_DIM aliased to QK_TURBO3_GROUP. Add static_assert(QK_TURBO4 == QK_TURBO3_GROUP) to guard turbo4 code that assumes block size == rotation group size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models with head_dim not divisible by 128 (e.g. GLM-4.7 Flash head_dim=576 → 576/32=18 blocks, 18/4=4.5 groups; deepseek2 K head_dim=192 → 6 blocks, 1.5 groups) silently dropped tail blocks because n_groups used floor division. Fix: use ceiling division for n_groups, add grp_blocks/grp_elems to handle partial tail groups. For the common 128-aligned case, the tail-handling conditionals are always true and the break never fires — zero behavior change for existing models. Design note: keep all loop bounds at compile-time QK_TURBO3_GROUP (128) and use (j < grp_elems) conditionals for zero-padding, rather than variable-bound loops that could affect Metal compiler codegen. Also adds GGML_ASSERT in WHT dispatch to catch non-128-aligned elements early with a clear error message. Fixes: TheTom#13 (GGML_ASSERT for non-128-aligned head dims) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Status check against current HEAD (
|
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resubmission of the non-breaking fixes from PR #4, rebased on current HEAD (turbo4 4-bit PolarQuant).
Dropped the turbo3/turbo4 kernel split that caused the regression. The turbo4 fixes are obsolete now that turbo4 has been redesigned. What's left:
Changes
Stack overflow in
turbo_init_rotation()— the 64KB stack-localfloat G[]segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate directly intoturbo_rotation, eliminating both the stack allocation and the memcpy.Non-128-aligned head dims (fixes CUDA
SET_ROWSturbo3:GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0)fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13) —n_groups = args.nk0 / blocks_per_groupsilently drops tail blocks for models like GLM-4.7 Flash (head_dim=576, 18 blocks → 4.5 groups) and deepseek2 K heads (head_dim=192, 6 blocks → 1.5 groups). Fix: ceiling division + tail handling in the sharedkernel_set_rows_turbotemplate. All loop bounds remain at compile-timeQK_TURBO3_GROUP(128) with(j < grp_elems)conditionals for zero-padding — avoids dynamic loop bounds that may have caused the v1 regression.Constant coupling — standalone
TURBO_D 128replaced withTURBO_ROT_DIM = QK_TURBO3_GROUP+static_assert(QK_TURBO4 == QK_TURBO3_GROUP)guard.WHT dispatch assert —
GGML_ASSERT(n_elements % 128 == 0)inggml_metal_op_turbo_whtfor early failure with clear error message.What's NOT included (vs PR #4)
Testing notes
For 128-aligned models (the common case),
grp_elems == 128andgrp_blocks == blocks_per_group— the conditionals are always true and the break never fires. Zero behavior change for existing models.🤖 Generated with Claude Code