chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork by localai-bot · Pull Request #10205 · mudler/LocalAI

localai-bot · 2026-06-06T21:30:00Z

Supersedes #10096 (same bump, but against current master with the compilation fixes the bump needs, and pinned to the latest fork HEAD).

What

Bumps TheTom/llama-cpp-turboquant to 7d9715f1 and fixes the turboquant build, which broke once the fork rebased onto modern upstream llama.cpp.

Root cause

The fork caught up to upstream's common_params_speculative refactor (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename and get_media_marker (#21962). LocalAI's fork-compat shim (patch-grpc-server.sh + the LOCALAI_LEGACY_LLAMA_CPP_SPEC guards) was still forcing the old fork's flat-field layout, so the now-modern fork failed to compile (has no member named 'mparams_dft' / 'type' / 'model' / ...). Separately, the fork's ggml_cuda_copy2d_across_devices uses CUDA 3D-peer copy APIs that ggml's HIP shim doesn't map, breaking only the hipblas job.

Changes

Drop the obsolete legacy-spec shim. Remove the dead LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared grpc-server.cpp (stock llama-cpp and the rebased fork both take the modern path now). The one remaining gap - the fork still lacks common_params::checkpoint_min_step - is narrowed to a dedicated LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by patch-grpc-server.sh. The patch script now only adds the turbo2/3/4 KV-cache types and injects that one macro. No behavior change for the stock llama-cpp build (its active code is identical - it never defined these macros).
HIP/MUSA-guard the cross-device 3D-peer copy. New patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by apply-patches.sh) guards the peer fast path with #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) - matching how the fork already guards the same API elsewhere - so HIP/MUSA fall through to the existing cudaMemcpyAsync staging fallback.

Verified statically against the 7d9715f1 fork sources; the patch passes git apply --check. CI is the compile proof.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the upstream common_params_speculative refactor (ggml-org/llama.cpp #22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker (#21962). The old fork-compat shim forced now-wrong legacy code paths, breaking the build with errors like 'struct common_params_speculative has no member named mparams_dft / type' and 'server_context_impl has no member named model'. Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared grpc-server.cpp (stock llama-cpp and the modern fork both take the modern path now), and narrow the one remaining gap (the fork still lacks common_params::checkpoint_min_step) to a dedicated LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by patch-grpc-server.sh. The patch script now only adds the turbo2/3/4 KV-cache types and injects that one macro. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

… cudaEventCreate) The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by apply-patches.sh) ports them: - Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks cudaMemcpy3DPeerAsync, per the fork's own comment). - Create the device event in ggml_backend_cuda_device_event_new with the HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the un-aliased plain cudaEventCreate, matching this file's own usage elsewhere. CUDA builds are unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin: beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate), its llama.cpp base fails to compile the flash-attention MMA f16 kernels for head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero / non-constant static asserts in fattn-mma-f16.cuh). That is a deep ggml-on-ROCm kernel issue, not something a small fork patch can paper over. Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

localai-bot · 2026-06-06T22:44:48Z

Dropped the -gpu-rocm-hipblas-turboquant build flavor: the fork is not ROCm-clean at this pin (beyond the CUDA-API gaps already patched, its llama.cpp base fails to compile the head-dim-640 flash-attention MMA f16 kernels under HIP — a deep ggml-on-ROCm issue). turboquant still ships for cpu/cublas/vulkan/sycl; ROCm can be re-added once the fork's HIP path compiles.

mudler added 2 commits June 6, 2026 20:39

chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1

3cdd6a8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

localai-bot mentioned this pull request Jun 6, 2026

chore: ⬆️ Update TheTom/llama-cpp-turboquant to 7d9715f1f071fa07c7b2ad3dbfd320b314139e65 #10096

Closed

mudler force-pushed the fix/turboquant-build branch from 91a3109 to 67ff7de Compare June 6, 2026 22:03

mudler merged commit 7402d1f into master Jun 7, 2026
78 checks passed

mudler deleted the fix/turboquant-build branch June 7, 2026 08:42

localai-bot added the dependencies label Jun 10, 2026

BrewTestBot mentioned this pull request Jun 10, 2026

localai 4.4.0 Homebrew/homebrew-core#287347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork#10205

chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork#10205
mudler merged 4 commits into
masterfrom
fix/turboquant-build

localai-bot commented Jun 6, 2026

Uh oh!

localai-bot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented Jun 6, 2026

What

Root cause

Changes

Uh oh!

localai-bot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants