llama: consistent ctx <-> buf order for KV cache by JohannesGaessler · Pull Request #16746 · ggml-org/llama.cpp

JohannesGaessler · 2025-10-23T19:32:22Z

See #16581 . The KV cache has the same issue which can lead to incorrect results for #16653 .

ggml/include/ggml-cpp.h

slaren · 2025-10-27T15:15:57Z

src/llama-kv-cache.cpp

+    // define a comparator for the buft -> ctx map to ensure that the order is well-defined:
+    struct ggml_backend_buft_comparator {
+        bool operator()(const ggml_backend_buffer_type_t & lhs, const ggml_backend_buffer_type_t & rhs) const {
+            return ggml_backend_buft_name(lhs) < ggml_backend_buft_name(rhs);


Is the intention here to compare the pointers? It should be more reliable to compare the strings regardless.

The intention here is to sort alphabetically and I forgot that these are C strings.

slaren · 2025-10-27T15:17:33Z

src/llama-memory-recurrent.cpp

+    struct ggml_backend_buft_comparator {
+        bool operator()(const ggml_backend_buffer_type_t & lhs, const ggml_backend_buffer_type_t & rhs) const {
+            return ggml_backend_buft_name(lhs) < ggml_backend_buft_name(rhs);
+        }
+    };


This can be moved to a common header (but not a public one) to avoid duplication, such as llama-impl.h.

The current situation is that the model has a 1-to-many mapping of contexts to buffers while the KV cache has a 1-to-1 mapping, so 2 different types would are needed where one is used once and the other one is used twice (previously the same type with a 1-t0-1 mapping was used but that was a bug). I think in the current version copying the code to 2 locations is preferable since it keeps locality. (But if you disagree I can change it.)

@ykhrustalev

* model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test * HIP: fix AMDGPU_TARGETS, update documentation (ggml-org#16803) * ggml : fix interpolate with align-corners and ne=1 (ggml-org#16700) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning * llama : disable pipeline parallelism if compute buffer allocation fails (ggml-org#16748) * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * chat: Add LFM2 tool handling (ggml-org#16763) * Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev * sycl: add SSM_CONV operation support (ggml-org#16800) * feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <tamarPal@example.com> * CUDA: add unused vars to mmvf and mmvq (ggml-org#16807) * CANN: Improve device ID handling and aclnnArange checks (ggml-org#16752) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var * grammar : support array references in json schema (ggml-org#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * llama: consistent ctx <-> buf order for KV cache (ggml-org#16746) * embedding: add raw option for --embd-output-format (ggml-org#16541) * Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Acly <aclysia@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: tamarPal <tamarp3385@gmail.com> Co-authored-by: tamarPal <tamarPal@example.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sam Malayek <12037535+SamMalayek@users.noreply.github.com>

JohannesGaessler requested review from CISC, ggerganov and slaren as code owners October 23, 2025 19:32

JohannesGaessler mentioned this pull request Oct 23, 2025

llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization #16653

Merged

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 23, 2025

slaren reviewed Oct 23, 2025

View reviewed changes

ggml/include/ggml-cpp.h Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the llama-kv-cache-fix-ctx-order branch 3 times, most recently from 41deb58 to 89cdb8f Compare October 27, 2025 13:49

slaren reviewed Oct 27, 2025

View reviewed changes

JohannesGaessler force-pushed the llama-kv-cache-fix-ctx-order branch from 89cdb8f to e70fbe3 Compare October 27, 2025 15:45

llama: consistent ctx <-> buf order for KV cache

98c7edd

JohannesGaessler force-pushed the llama-kv-cache-fix-ctx-order branch from e70fbe3 to 98c7edd Compare October 27, 2025 21:53

slaren approved these changes Oct 28, 2025

View reviewed changes

JohannesGaessler merged commit 7a0e900 into ggml-org:master Oct 28, 2025
72 checks passed

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

llama: consistent ctx <-> buf order for KV cache (ggml-org#16746)

196f410

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

llama: consistent ctx <-> buf order for KV cache (#16746)

dfc3b85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: consistent ctx <-> buf order for KV cache#16746

llama: consistent ctx <-> buf order for KV cache#16746
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:llama-kv-cache-fix-ctx-order

JohannesGaessler commented Oct 23, 2025

Uh oh!

Uh oh!

slaren Oct 27, 2025

Uh oh!

JohannesGaessler Oct 27, 2025

Uh oh!

slaren Oct 27, 2025

Uh oh!

JohannesGaessler Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JohannesGaessler commented Oct 23, 2025

Uh oh!

Uh oh!

slaren Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants