Skip to content

chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork#10205

Merged
mudler merged 4 commits into
masterfrom
fix/turboquant-build
Jun 7, 2026
Merged

chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork#10205
mudler merged 4 commits into
masterfrom
fix/turboquant-build

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Supersedes #10096 (same bump, but against current master with the compilation fixes the bump needs, and pinned to the latest fork HEAD).

What

Bumps TheTom/llama-cpp-turboquant to 7d9715f1 and fixes the turboquant build, which broke once the fork rebased onto modern upstream llama.cpp.

Root cause

The fork caught up to upstream's common_params_speculative refactor (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename and get_media_marker (#21962). LocalAI's fork-compat shim (patch-grpc-server.sh + the LOCALAI_LEGACY_LLAMA_CPP_SPEC guards) was still forcing the old fork's flat-field layout, so the now-modern fork failed to compile (has no member named 'mparams_dft' / 'type' / 'model' / ...). Separately, the fork's ggml_cuda_copy2d_across_devices uses CUDA 3D-peer copy APIs that ggml's HIP shim doesn't map, breaking only the hipblas job.

Changes

  1. Drop the obsolete legacy-spec shim. Remove the dead LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared grpc-server.cpp (stock llama-cpp and the rebased fork both take the modern path now). The one remaining gap - the fork still lacks common_params::checkpoint_min_step - is narrowed to a dedicated LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by patch-grpc-server.sh. The patch script now only adds the turbo2/3/4 KV-cache types and injects that one macro. No behavior change for the stock llama-cpp build (its active code is identical - it never defined these macros).
  2. HIP/MUSA-guard the cross-device 3D-peer copy. New patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by apply-patches.sh) guards the peer fast path with #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) - matching how the fork already guards the same API elsewhere - so HIP/MUSA fall through to the existing cudaMemcpyAsync staging fallback.

Verified statically against the 7d9715f1 fork sources; the patch passes git apply --check. CI is the compile proof.

mudler added 2 commits June 6, 2026 20:39
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the
upstream common_params_speculative refactor (ggml-org/llama.cpp
#22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker
(#21962). The old fork-compat shim forced now-wrong legacy code paths,
breaking the build with errors like 'struct common_params_speculative has
no member named mparams_dft / type' and 'server_context_impl has no member
named model'.

Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared
grpc-server.cpp (stock llama-cpp and the modern fork both take the modern
path now), and narrow the one remaining gap (the fork still lacks
common_params::checkpoint_min_step) to a dedicated
LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by
patch-grpc-server.sh. The patch script now only adds the turbo2/3/4
KV-cache types and injects that one macro.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
… cudaEventCreate)

The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that
ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant
build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by
apply-patches.sh) ports them:

- Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with
  #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through
  to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks
  cudaMemcpy3DPeerAsync, per the fork's own comment).
- Create the device event in ggml_backend_cuda_device_event_new with the
  HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the
  un-aliased plain cudaEventCreate, matching this file's own usage elsewhere.

CUDA builds are unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler force-pushed the fix/turboquant-build branch from 91a3109 to 67ff7de Compare June 6, 2026 22:03
The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin:
beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate),
its llama.cpp base fails to compile the flash-attention MMA f16 kernels for
head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero /
non-constant static asserts in fattn-mma-f16.cuh). That is a deep
ggml-on-ROCm kernel issue, not something a small fork patch can paper over.

Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still
ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path
compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@localai-bot

Copy link
Copy Markdown
Collaborator Author

Dropped the -gpu-rocm-hipblas-turboquant build flavor: the fork is not ROCm-clean at this pin (beyond the CUDA-API gaps already patched, its llama.cpp base fails to compile the head-dim-640 flash-attention MMA f16 kernels under HIP — a deep ggml-on-ROCm issue). turboquant still ships for cpu/cublas/vulkan/sycl; ROCm can be re-added once the fork's HIP path compiles.

@mudler mudler merged commit 7402d1f into master Jun 7, 2026
78 checks passed
@mudler mudler deleted the fix/turboquant-build branch June 7, 2026 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants