fix: stack overflow, non-128-aligned head dims, constant coupling#18
fix: stack overflow, non-128-aligned head dims, constant coupling#18seanrasch wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
Conversation
Two fixes in ggml-turbo-quant.c: 1. Stack overflow: turbo_init_rotation() allocated a 64KB float G[] on the stack then memcpy'd to turbo_rotation. On llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux), this segfaults. Fix: generate directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. Constant coupling: TURBO_D was a standalone #define 128, duplicating QK_TURBO3_GROUP from ggml-common.h. Replace with TURBO_ROT_DIM aliased to QK_TURBO3_GROUP. Add static_assert(QK_TURBO4 == QK_TURBO3_GROUP) to guard turbo4 code that assumes block size == rotation group size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models with head_dim not divisible by 128 (e.g. GLM-4.7 Flash head_dim=576 → 576/32=18 blocks, 18/4=4.5 groups; deepseek2 K head_dim=192 → 6 blocks, 1.5 groups) silently dropped tail blocks because n_groups used floor division. Fix: use ceiling division for n_groups, add grp_blocks/grp_elems to handle partial tail groups. For the common 128-aligned case, the tail-handling conditionals are always true and the break never fires — zero behavior change for existing models. Design note: keep all loop bounds at compile-time QK_TURBO3_GROUP (128) and use (j < grp_elems) conditionals for zero-padding, rather than variable-bound loops that could affect Metal compiler codegen. Also adds GGML_ASSERT in WHT dispatch to catch non-128-aligned elements early with a clear error message. Fixes: TheTom#13 (GGML_ASSERT for non-128-aligned head dims) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Status check against current HEAD (
|
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (TheTom#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (TheTom#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (TheTom#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (TheTom#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (TheTom#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (TheTom#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (TheTom#17) * meta : formatting, naming, indentation (TheTom#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (TheTom#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (TheTom#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (TheTom#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (TheTom#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (TheTom#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (TheTom#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (TheTom#17) * meta : formatting, naming, indentation (TheTom#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…eTom#18) The turbo KV cache code uses CUDA __device__ constant memory for runtime-initialized values (InnerQ channel scales, calibration accumulators, Q² statistics, extraction buffers). These symbols are per-device in CUDA — each GPU gets its own copy. All initialization functions (turbo_innerq_init, turbo_innerq_init_fattn, turbo_innerq_start_calibration, turbo_innerq_finalize_calibration, turbo_innerq_update_fattn_scales, turbo_q_calibrate_init, turbo_q_calibrate_finalize, turbo_extract_init, turbo_extract_check_done) used cudaMemcpyToSymbol without specifying which device to target, so they only wrote to whichever GPU happened to be active. On multi-GPU setups with --tensor-split, the other GPU(s) would read uninitialized constant memory, causing garbage output or illegal memory access crashes. Standard cache types like q8_0 are unaffected because they use either compile-time constants or read directly from global memory. Fix: save/restore the current device and loop over all devices for every cudaMemcpyToSymbol call in the turbo initialization and calibration paths. Fixes TheTom#17 Relates to TheTom#12 Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resubmission of the non-breaking fixes from PR #4, rebased on current HEAD (turbo4 4-bit PolarQuant).
Dropped the turbo3/turbo4 kernel split that caused the regression. The turbo4 fixes are obsolete now that turbo4 has been redesigned. What's left:
Changes
Stack overflow in
turbo_init_rotation()— the 64KB stack-localfloat G[]segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate directly intoturbo_rotation, eliminating both the stack allocation and the memcpy.Non-128-aligned head dims (fixes CUDA
SET_ROWSturbo3:GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0)fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13) —n_groups = args.nk0 / blocks_per_groupsilently drops tail blocks for models like GLM-4.7 Flash (head_dim=576, 18 blocks → 4.5 groups) and deepseek2 K heads (head_dim=192, 6 blocks → 1.5 groups). Fix: ceiling division + tail handling in the sharedkernel_set_rows_turbotemplate. All loop bounds remain at compile-timeQK_TURBO3_GROUP(128) with(j < grp_elems)conditionals for zero-padding — avoids dynamic loop bounds that may have caused the v1 regression.Constant coupling — standalone
TURBO_D 128replaced withTURBO_ROT_DIM = QK_TURBO3_GROUP+static_assert(QK_TURBO4 == QK_TURBO3_GROUP)guard.WHT dispatch assert —
GGML_ASSERT(n_elements % 128 == 0)inggml_metal_op_turbo_whtfor early failure with clear error message.What's NOT included (vs PR #4)
Testing notes
For 128-aligned models (the common case),
grp_elems == 128andgrp_blocks == blocks_per_group— the conditionals are always true and the break never fires. Zero behavior change for existing models.🤖 Generated with Claude Code