chore(turboquant): bump to 7d9715f1 + fix compilation against rebased fork#10205
Merged
Conversation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the upstream common_params_speculative refactor (ggml-org/llama.cpp #22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker (#21962). The old fork-compat shim forced now-wrong legacy code paths, breaking the build with errors like 'struct common_params_speculative has no member named mparams_dft / type' and 'server_context_impl has no member named model'. Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared grpc-server.cpp (stock llama-cpp and the modern fork both take the modern path now), and narrow the one remaining gap (the fork still lacks common_params::checkpoint_min_step) to a dedicated LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by patch-grpc-server.sh. The patch script now only adds the turbo2/3/4 KV-cache types and injects that one macro. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
… cudaEventCreate) The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by apply-patches.sh) ports them: - Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks cudaMemcpy3DPeerAsync, per the fork's own comment). - Create the device event in ggml_backend_cuda_device_event_new with the HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the un-aliased plain cudaEventCreate, matching this file's own usage elsewhere. CUDA builds are unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
91a3109 to
67ff7de
Compare
The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin: beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate), its llama.cpp base fails to compile the flash-attention MMA f16 kernels for head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero / non-constant static asserts in fattn-mma-f16.cuh). That is a deep ggml-on-ROCm kernel issue, not something a small fork patch can paper over. Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Collaborator
Author
|
Dropped the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #10096 (same bump, but against current master with the compilation fixes the bump needs, and pinned to the latest fork HEAD).
What
Bumps
TheTom/llama-cpp-turboquantto7d9715f1and fixes the turboquant build, which broke once the fork rebased onto modern upstream llama.cpp.Root cause
The fork caught up to upstream's
common_params_speculativerefactor (ggml-org/llama.cpp#22397/#22838/#22964), themodel_tgtrename andget_media_marker(#21962). LocalAI's fork-compat shim (patch-grpc-server.sh+ theLOCALAI_LEGACY_LLAMA_CPP_SPECguards) was still forcing the old fork's flat-field layout, so the now-modern fork failed to compile (has no member named 'mparams_dft' / 'type' / 'model' / ...). Separately, the fork'sggml_cuda_copy2d_across_devicesuses CUDA 3D-peer copy APIs that ggml's HIP shim doesn't map, breaking only the hipblas job.Changes
LOCALAI_LEGACY_LLAMA_CPP_SPECbranches from the sharedgrpc-server.cpp(stock llama-cpp and the rebased fork both take the modern path now). The one remaining gap - the fork still lackscommon_params::checkpoint_min_step- is narrowed to a dedicatedLOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEPguard injected bypatch-grpc-server.sh. The patch script now only adds theturbo2/3/4KV-cache types and injects that one macro. No behavior change for the stock llama-cpp build (its active code is identical - it never defined these macros).patches/0001-hip-guard-copy2d-peer-fastpath.patch(applied byapply-patches.sh) guards the peer fast path with#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)- matching how the fork already guards the same API elsewhere - so HIP/MUSA fall through to the existingcudaMemcpyAsyncstaging fallback.Verified statically against the
7d9715f1fork sources; the patch passesgit apply --check. CI is the compile proof.