llama: consistent ctx <-> buf order for KV cache#16746
llama: consistent ctx <-> buf order for KV cache#16746JohannesGaessler merged 1 commit intoggml-org:masterfrom
Conversation
41deb58 to
89cdb8f
Compare
src/llama-kv-cache.cpp
Outdated
| // define a comparator for the buft -> ctx map to ensure that the order is well-defined: | ||
| struct ggml_backend_buft_comparator { | ||
| bool operator()(const ggml_backend_buffer_type_t & lhs, const ggml_backend_buffer_type_t & rhs) const { | ||
| return ggml_backend_buft_name(lhs) < ggml_backend_buft_name(rhs); |
There was a problem hiding this comment.
Is the intention here to compare the pointers? It should be more reliable to compare the strings regardless.
There was a problem hiding this comment.
The intention here is to sort alphabetically and I forgot that these are C strings.
| struct ggml_backend_buft_comparator { | ||
| bool operator()(const ggml_backend_buffer_type_t & lhs, const ggml_backend_buffer_type_t & rhs) const { | ||
| return ggml_backend_buft_name(lhs) < ggml_backend_buft_name(rhs); | ||
| } | ||
| }; |
There was a problem hiding this comment.
This can be moved to a common header (but not a public one) to avoid duplication, such as llama-impl.h.
There was a problem hiding this comment.
The current situation is that the model has a 1-to-many mapping of contexts to buffers while the KV cache has a 1-to-1 mapping, so 2 different types would are needed where one is used once and the other one is used twice (previously the same type with a 1-t0-1 mapping was used but that was a bug). I think in the current version copying the code to 2 locations is preferable since it keeps locality. (But if you disagree I can change it.)
89cdb8f to
e70fbe3
Compare
e70fbe3 to
98c7edd
Compare
* model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test * HIP: fix AMDGPU_TARGETS, update documentation (ggml-org#16803) * ggml : fix interpolate with align-corners and ne=1 (ggml-org#16700) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning * llama : disable pipeline parallelism if compute buffer allocation fails (ggml-org#16748) * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * chat: Add LFM2 tool handling (ggml-org#16763) * Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev * sycl: add SSM_CONV operation support (ggml-org#16800) * feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <tamarPal@example.com> * CUDA: add unused vars to mmvf and mmvq (ggml-org#16807) * CANN: Improve device ID handling and aclnnArange checks (ggml-org#16752) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var * grammar : support array references in json schema (ggml-org#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * llama: consistent ctx <-> buf order for KV cache (ggml-org#16746) * embedding: add raw option for --embd-output-format (ggml-org#16541) * Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Acly <aclysia@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: tamarPal <tamarp3385@gmail.com> Co-authored-by: tamarPal <tamarPal@example.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sam Malayek <12037535+SamMalayek@users.noreply.github.com>
See #16581 . The KV cache has the same issue which can lead to incorrect results for #16653 .