Releases: LMCache/LMCache
Releases · LMCache/LMCache
Nightly 2026-05-09 · CUDA 13.0
Nightly CUDA 13.0 wheels built from dev on 2026-05-09.
uv pip install lmcache --pre \
--extra-index-url https://download.pytorch.org/whl/cu130 \
--find-links https://github.com/LMCache/LMCache/releases/expanded_assets/nightly-cu13 \
--index-strategy unsafe-best-match
Nightly 2026-05-09 · CUDA 12.9
Nightly CUDA 12.9 wheels built from dev on 2026-05-09.
uv pip install lmcache --pre \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--find-links https://github.com/LMCache/LMCache/releases/expanded_assets/nightly \
--index-strategy unsafe-best-match
Release v0.4.4 · CUDA 13.0
CUDA 13.0 wheel for LMCache v0.4.4.
uv pip install lmcache==v0.4.4 \
--extra-index-url https://download.pytorch.org/whl/cu130 \
--find-links https://github.com/LMCache/LMCache/releases/expanded_assets/v0.4.4-cu13 \
--index-strategy unsafe-best-match
v0.4.4
What's Changed
- Refactor remote plugin to accept multiply connector by @maobaolong in #2666
- [MP]feat: support different kv cache shape and dtype across layers by @liuyumoye in #2926
- [Chore][CI]: K3 base CI image 12.9 CUDA by @sammshen in #2975
- fix: use pin=False in _allocate_and_put to prevent pd_buffer leak by @ningziwen in #2847
- feat(disk): support multi-path local disk backend for multi-device I/O by @glimchb in #2801
- [Chore][CI] Upgrade CI base image to CUDA 13.0 by @sammshen in #2981
- [doc] document long-doc-permutator workload in cli bench by @deng451e in #2963
- [MP][Bugfix] Fix deadlock caused by cuda launch host func by @ApostaC in #2952
- [BugFix]: Fix typo bug by @princepride in #2980
- [CI] Pin cu128 nightly wheel for blend ci test by @deng451e in #2987
- [MP][optimize] optimize save when mla enabled by @chunxiaozheng in #2935
- [hotfix] fix prometheus version for UT failure by @ApostaC in #3000
- Update LMCache Office Hours to Wednesday by @nijaba in #2990
- [fix] Limit proxy in-flight requests to prevent PD buffer deadlock by @deng451e in #2957
- [MP] Lazy start heartbeat thread when first req coming by @maobaolong in #2943
- [Operator] Add L2 RESP (Redis/Valkey) adapter support by @royyhuang in #2967
- [Feat][RawBlock] Add TP>1 support and compact batched retrieval path by @DongDongJu in #2948
- [MP] Introduce a simple way to register_gauge metrics. by @maobaolong in #2906
- [Build] Add lmcache-cli lightweight wheel by @deng451e in #2959
- Copy a snapshot of lmcache_mp_connector.py for vllm 0.18.0 by @maobaolong in #2887
- [MP] Add a new argument to specify whether retain_in_l1 by @maobaolong in #2813
- [Chore][CI] Skip k3 builds when only docs/trivial files changed by @sammshen in #2993
- [ops][refactor] Add full list of Python fallbacks to run without compiled CUDA extensions by @hlin99 in #2591
- [Feat] L0 Subscriber by @Oasis-Git in #2974
- refactor: extract PathSharder module for shared multi-path selection by @glimchb in #2982
- refactor(mp): replace job_id with request_id in query_prefetch_status by @yoo-kumaneko in #2996
- [MP] Support lazy import built-in l2 adapter by @maobaolong in #2905
- [MP][Optimize] Skip locked keys during LRU eviction to improve eviction efficiency by @chunxiaozheng in #2978
- fix: add controller config validation and clear error messages (#2907) by @ianliuy in #3003
- feat: add chunk hashes logger to MP server for offline data analysis by @yoo-kumaneko in #2928
- [Chore][CI]: K3 MP output token quantity tolerance by @sammshen in #3030
- feat(tools): add LRU cache simulator for lookup-hash JSONL logs by @yoo-kumaneko in #3021
- [Feat] L1 Subscriber by @Oasis-Git in #2986
- [Feat] Add cache_salt parameter to MP adapter interfaces by @royyhuang in #3029
- [Feat] Add is_user_level property and cache_salt param to EvictionPolicy by @royyhuang in #3032
- [Feat][DAX] Optimize staged batched restore path and document modification by @DongDongJu in #2904
- [Chore] Remove v0 code by @sammshen in #2968
- [Chore] add coding standard and PR review instructions by @ApostaC in #3039
- [Observability] Per-request root OTel span and SpanRegistry for MP server tracing by @deng451e in #3033
- feat(pd_backend): add pd_skip_proxy_notification to skip ZMQ proxy notification by @ningziwen in #2874
- [Bugfix] fix some memory leak in cache_engine and eic connector by @liubj77 in #2544
- [Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain by @sammshen in #3055
- [Hotfix][CI] Pin vLLM nightly to cu130 index to match CUDA 13 base image by @ApostaC in #3061
- [Docs] Mirror lmcache/ layout in docs/design/ for discoverability by @ApostaC in #3040
- Add scheduler instance_id and model_name to L0 KV lifecycle tracking by @Oasis-Git in #3043
- chore: expose package version via init.py by @hlin99 in #3034
- Fix: Safely handle layerwise cache shape dimensions in remote backend by @hlin99 in #2751
- [Core] Add persistence interfaces and nixl persistence by @YaoJiayi in #2938
- [Misc] Reduce the logs generated by lazy memory allocator by @ApostaC in #3068
- [MP][Feat] Add cache_salt to ObjectKey for cache isolation by @royyhuang in #3042
- [ROCm] Make bare-host ROCm install self-sufficient by @Shaoting-Feng in #3070
- [MP] Add tracing functionality for storage manager by @ApostaC in #3063
- [MP][optimize] unified touch all keys in end session request by @chunxiaozheng in #3020
- [step3] remove unnecessary code in mp adapter by @chunxiaozheng in #2994
- fix(mp): correct store cached requests in lmcache_mp_connector by @maobaolong in #3012
- [refactor]: Replace use_cufile with use_gds/gds_backend config flags by @glimchb in #2858
- [CI] Add cu13.0 wheel + container builds and nightly wheel releases by @deng451e in #3069
- [CI] Run the same test set on AMD as on NVIDIA by @Shaoting-Feng in #3071
- [ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB by @Shaoting-Feng in #3079
- [CLI] Refactor query command by @deng451e in #2995
- [CI] add missing egress endpoints to nightly Docker build by @deng451e in #3087
- [Hotfix][CI] Fail-fast when vLLM CLI import chain is broken post-install by @sammshen in #3093
- [CLI][fix] lazy torch import in init.py to unblock CLI-only installs by @deng451e in #3086
- [CLI] Introduce lmcache trace CLI by @ApostaC in #3075
- [Chore][Docs]: daily drift check — multi-process mode by @ApostaC in #3076
- [Fix][CI] fix nightly wheel versioning and build reliability by @deng451e in #3097
- [Hotfix][CI] Replace vllm main.py patch with sitecustomize.py by @sammshen in #3100
- [CI] fix blend-server venv by @deng451e in #3099
- [MP] Introduce MP runtime plugin framework by @maobaolong in #2956
New Contributors
Full Changelog: v0.4.3...v0.4.4
v0.4.3
What's Changed
- [MP] fix: add thread safety to Session for concurrent TP worker access by @maobaolong in #2807
- [CLI] Implement initial framework of LMCache CLI by @KuntaiDu in #2775
- [MP][Observability][1/3] EventBus core infrastructure + OpenTelemetry dependency by @royyhuang in #2792
- [MP]: Support delay start heartbeat thread to avoid unhealthy while start vllm for a huge module warmup. by @maobaolong in #2798
- fix: add None check before stream synchronization by @hlin99 in #2810
- [Core] Add VRAM_SEG support for NIXL OBJ plugin by @jgoldsch12 in #2640
- [CI]: create fallback for flaky nightly index by @sammshen in #2809
- [CI]: add full tag selectively by @sammshen in #2820
- fix: replace global lock with per-device transfer_lock to prevent deadlock by @maobaolong in #2816
- Refactor KV cache shape/dtype extraction for robustness by @hlin99 in #2537
- Support non-contiguous alloc in MemoryAllocator by @chunxiaozheng in #2767
- [MP][Observability][2/3] Migrate L1 + SM to EventBus + OTel, remove old Prometheus pipeline by @royyhuang in #2794
- [MP][Bugfix] fixing race condition for zmq output notifier by @ApostaC in #2808
- [ci]: agent reviewer prompt engineering by @sammshen in #2800
- [refactor]: clean up the messy LMCacheManager by @sammshen in #2683
- [Platform]: Add Intel Gaudi (HPU) Support by @hlin99 in #2822
- [CLI] Implement
lmcache describe kvcachesubcommand by @royyhuang in #2825 - [MP][Feat] Query lookup-phase status for MP mode by @ApostaC in #2818
- Add Device-DAX (/dev/dax) storage backend for KV cache (follow-up to #2714) by @jayhpark530 in #2788
- [Temp CI Patch]: torch version for UT by @sammshen in #2856
- [CI] Add GitHub Action to auto-sync torch version with vLLM by @deng451e in #2796
- [MP][Feat] support worker-affinity in the MQ thread pool by @ApostaC in #2842
- Introduce native fs connector by @maobaolong in #2779
- [CLI] Implement
lmcache pingsubcommand by @Oasis-Git in #2859 - [MP] Fault Tolerance CI by @Oasis-Git in #2764
- feat: improve ValkeyConnector with cluster mode, TLS, and GLIDE optimizations by @omerrubi-amzn in #2790
- fix: auto-generate lmcache_instance_id when value is None by @can-sun in #2732
- [CI]: use job-level path filtering so skipped tests pass required checks by @royyhuang in #2855
- [MP] Print inference request id to help identify which vllm request the current log belongs to by @maobaolong in #2812
- [HW: XPU] Enable Layerwise XPU Connector by @slokesha in #2611
- [CLI] lmcache query engine subcommand by @deng451e in #2846
- [CLI]: Server command by @sammshen in #2836
- [LMCache CLI] Design and implementation of
lmcache kvcacheby @KuntaiDu in #2827 - [Bugfix]: Fix pin count balancing in PD Disaggregation mode by @lisiG9 in #2786
- [Core] [GDS] Improve GDS backend error handling and retry logic by @oferki in #2675
- [CLI][Doc] Edit the doc for LMCache CLI by @KuntaiDu in #2870
- Add hipFile support for AIS (AMD Infinity Storage) storage by @glimchb in #2799
- [CI]: Fix the LMCache random throughput being higher than native vllm by @sammshen in #2864
- [3/N][Feat]Persist metadata on device and fix raw-device benchmark setup by @DongDongJu in #2614
- [Core]: Support HND KV Format by @sammshen in #2826
- [Chore][Docs] Fix mp docs for store policy: skip_l1 by @ApostaC in #2869
- [MP][Core] Block id based kernel for MP mode by @ApostaC in #2838
- [CLI] update cli lmcache query engine by @deng451e in #2871
- [MP] Improve the stability for controllers and improve log clarity by @ApostaC in #2883
- [Chore][Docs] Stale MP CLI and Flags by @sammshen in #2882
- [Fix][Operator] Add privileged mode and nvidia runtime for GPU visibility by @royyhuang in #2749
- [Chore][CI]: chmod +x scripts in k3 test entrypoints by @sammshen in #2886
- feat(gds): add multipath KV-cache offloading support by @glimchb in #2817
- fix: add missing lock protection for LRU cache policy by @SYaoJun in #2860
- [MP][Observability][3/3] Migrate MP server telemetry to EventBus, unify config by @royyhuang in #2806
- [doc] update installation compatibility doc by @deng451e in #2868
- [Build] add SM120 for wheel build by @deng451e in #2873
- [1/2] L2 CI: End to End Performance by @Oasis-Git in #2884
- [fix] add missing request type in blend server by @deng451e in #2894
- type: Add missing return type annotations to storage backend methods by @SYaoJun in #2829
- [CLI] Implementation of lmcache bench engine by @ApostaC in #2889
- feat(gds): enable parallel I/O thread pool for all cuFile filesystems by @glimchb in #2802
- [DSA] support DSA in Mooncake connector by @chunxiaozheng in #2897
- [Core] Add L2 eviction in mp mode by @YaoJiayi in #2824
- [Bugfix] fix the invalid image path by @SYaoJun in #2899
- [Chore][CI] Split k3 multiprocess tests into parallel pipeline steps by @sammshen in #2914
- Support l2 adapter check and improve basic_check tool by @maobaolong in #2895
- [Chore][CI/Docs]: Switch all the documentation and CI over to
lmache cliby @sammshen in #2917 - [CI] Add CI test for CB by @deng451e in #2900
- [2/2] L2 CI: Telemetry Test by @Oasis-Git in #2913
- [Core] Add eviction for CB by @YaoJiayi in #2893
- Refactor: Generalize utils.py for all devices by lifting the CUDA limitation by @hlin99 in #2848
- Add argument --prefetch-max-in-flight to fix hardcode by @maobaolong in #2789
- [MP] Refactor l2 plugin framework to support dynamic load third-party native l2 connector by @maobaolong in #2851
- fix: relax worker port count assertion by @can-sun in #2867
- [Bugfix]: patch save_decode_cache by @sammshen in #2929
- vllm block event by @Oasis-Git in #2930
- [Feat]: Add eviction to L2 Native Backend by @sammshen in #2939
- [Connector] Maru: zero-copy KV cache sharing via CXL shared memory by @jooho-XCENA in #2705
- [MP] Fix UT after merge #2851 by @maobaolong in #2931
- [Bugfix]: fix get_num_heads for MLA format by @sammshen in #2941
- [MP] Introduce l2 mooncake adapter by @maobaolong in #2911
- [CLI]Add long-doc-permutator CLI bench workload by @deng451e in #2937
- feat(gds): add gds_path_sharding config for multi-path strategy by @glimchb in #2922
- [Security][Remote Connector]: Add env var auth config for RESP by @sammshen in #2949
- Refactor: Align pd_buffer_size to chunk size in PD backend by @hlin99 in #2694
- [Chore] Add CODEOWNERS for automated PR review assignments by @sammshen in #2950
- [Chore][CI]: Change dst for K3 nightly comprehensive results by @sammshen in #2958
New Contributors
- @jgoldsch12 made their first contribution in #2640
- @jayhpark530 made their first contribution in #2788
- @omerrubi-amzn made their...
v0.4.2
What's Changed
- fix(l1_manager): propagate extra_count through prefetch path to prevent premature eviction by @liuyumoye in #2725
- [vllm adapter] num_lmcache_cached_tokens by @aeon-x in #2670
- [ci]: add gpu monitoring by @sammshen in #2718
- [CI][Hotfix][Chore] remove the repetitive definition of report_status by @ApostaC in #2745
- [Perf] [GDS] Performance improvements to GDS backend by @oferki in #2637
- Fault Tolerance Check by @Oasis-Git in #2692
- [Misc] Remove Hash from IPCCacheEngineKey by @Oasis-Git in #2700
- [MP][optimize] optimize evict in lru policy by @chunxiaozheng in #2740
- [RFC] Design of LMCache CLI by @KuntaiDu in #2748
- [MP][Bugfix] introducing new l1 listener to prevent re-storing prefetched object by @ApostaC in #2744
- Add filesystem-backed L2 adapter with auto-discovery plugin mechanism by @maobaolong in #2704
- fix(server): guard finish_read_prefetched behind retrieve_succeeded flag by @maobaolong in #2736
- Fix[config]: replace store_true with BooleanOptionalAction for --l1-use-lazy by @liuyumoye in #2761
- [Correctness]: Fix the overlapping race condition for non-MP as well by @sammshen in #2706
- [Southbound]: Create a Native Protocol for MP and non-MP by @sammshen in #2642
- [ci]: fix k3 comprehensive test nightly baseline retrieval by @sammshen in #2753
- [Perf] Add stream priority in gpu context by @YaoJiayi in #2728
- [Doc] Add doc for LMCache MP mode operator by @royyhuang in #2731
- [Docs][Operator] Fix observability metric descriptions by @royyhuang in #2746
- [MP][Feat] Support dedicated thread pool for MP callbacks by @ApostaC in #2763
- [MP][UX][L2] Support configuring L2 store/prefetch policy via command line by @ApostaC in #2773
- Fix regression: restore config validate() call in config.py by @hlin99 in #2690
- [MP] Support buffer only mode for MP mode by @maobaolong in #2760
- Plugin L2 Adapter Framework for MP Mode by @maobaolong in #2715
- [MP][Bugfix] fix free error when memory_objs is empty by @chunxiaozheng in #2768
- update torch version aligned with vllm by @deng451e in #2782
- Support database option at Valkey connector by @bluayer in #2307
- feat(kv_cache): enable asymmetric store/retrieve storages in PD backend by @hlin99 in #2509
Full Changelog: v0.4.1...v0.4.2
v0.4.1
v0.4.0
Major Milestones
v0.4.0 marks the maturation and shift in LMCache towards the new Multiprocess mode.
What's Changed
- [feat] add free_locks api to MP mode by @royyhuang in #2656
- [Add] L2 Prefetch Controller and StorageManager integration by @ApostaC in #2667
- K3 CI Refactor by @sammshen in #2663
- Tell agent to write documentations by @KuntaiDu in #2655
- Refactor new_block_ids handling for robustness by @hlin99 in #2536
- [CI] Fix mypy errors by @hickeyma in #2672
- fix(lmcache): fix KV cache hash inconsistency due to None in extra_keys by @JianDan0212 in #1897
- Augmenting
contributing.mdby @KuntaiDu in #2654 - Bump actions/download-artifact from 6.0.0 to 7.0.0 by @dependabot[bot] in #2397
- [MP][UX] Unified config + argparse for multiprocess mode by @ApostaC in #2695
- [CI]: 5 day maximum for Comprehensive Test flexibility by @sammshen in #2676
- [Correctness]: Avoid overwriting APC overlap by @sammshen in #2671
- [MP][Observability] Add telemetry subsystem for multiprocess mode by @ApostaC in #2696
- [1/N] Support NIXL-based L2 storage in MP mode by @YaoJiayi in #2664
- [Feat] LMCache MP mode k8s operator by @royyhuang in #2701
- [MP][Telemetry] Hot-fix to enable the telemetry logging for store by @ApostaC in #2707
- [Bugfix] Fix memory leak in asynchronous mode by @deng451e in #2559
- [Misc] Improve nixl perf in lmcache mp by @YaoJiayi in #2711
- [MP][Core] Update the workflow for lookup to avoid busy loop by @ApostaC in #2710
- Fix to support mla multiple tp failed to read issue by @maobaolong in #2697
- Refactor lookup client/server and abstract rpc layer. by @maobaolong in #2609
- [Core] Add blend_server_v2 by @YaoJiayi in #2677
- [Bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine by @liubj77 in #2516
- [Chore][Docs] Update docs for MP mode by @ApostaC in #2708
- [Misc] Fix failing unit test in blend server by @YaoJiayi in #2717
- [MP][Debuggability] Introduce status report subsystem for MP-mode by @ApostaC in #2699
- [MP][Hotfix] add default implementation for report_status by @ApostaC in #2723
- [MP] Support MP Server restart by @maobaolong in #2713
- Revert "[MP] Support MP Server restart (#2713)" by @ApostaC in #2729
- [MP][UX][Docs] Enhance http server and its docs for MP mode by @ApostaC in #2722
- [MP] Update the MP docs and pass telemetry config into http_server by @ApostaC in #2730
New Contributors
- @JianDan0212 made their first contribution in #1897
- @liubj77 made their first contribution in #2516
Full Changelog: v0.3.15...v0.4.0
Operator Nightly Latest (nightly-20260510-d945fbb)
Automated nightly operator build from dev branch.
Image: lmcache/lmcache-operator:nightly-20260510-d945fbb
kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-nightly-latest/install.yamlv0.3.15
What's Changed
- Introduce reset metrics api by @maobaolong in #2602
- Add req id to store/store_layer/retrieve/retrieve_layer log by @maobaolong in #2604
- Add an override inner field to support override extra config by @maobaolong in #2605
- Add a ut for basic check by @maobaolong in #2612
- [DOC] Introduce LMCache frontend document by @maobaolong in #2618
- [DOC] Complete the internal_api_server api document by @maobaolong in #2617
- Check failed put task count and record metrics by @maobaolong in #2439
- [UT] Add UT for utils.py by @maobaolong in #2615
- [Core] Add enum for EngineType by @hickeyma in #2555
- Add hot cache switch internal api by @maobaolong in #2620
- Add bundle of bypass backend internal apis by @maobaolong in #2619
- [Observability]: Fix vllm cached and prompt tokens by @sammshen in #2576
- [Bugfix] Fix layerwise wait_for_save concurrency crash with request-scoped storers by @DongDongJu in #2613
- Add lookup api to support dynamic recreate lookup client/server by @maobaolong in #2625
- refactor: read config values dynamically instead of caching in instance variables by @maobaolong in #2610
- Using shm to reduce memory copy while using remote connector by @maobaolong in #2601
- Add a backend api to support dynamic close&create backends by @maobaolong in #2622
- Support customize the bucket of histogram metrics by @maobaolong in #2627
- [Remote Connector]: cpp multi-threaded RESP by @sammshen in #2541
- [2/N][Feat] Add zero-copy aligned buffer odirect by @DongDongJu in #2573
- [1/4] Bitmap for L2 storage in MP mode by @ApostaC in #2563
- [2/4] L2Adapter interface and implementation of MockL2Adapter by @ApostaC in #2569
- [Misc] Adpot the new token matching solution by @ApostaC in #2599
- [MP] Protocol with Single Key by @Oasis-Git in #2584
- [feat] add observability stack to MP mode by @royyhuang in #2638
- [Chore][Admin] Create initial AGENTS.md by @ApostaC in #2649
- [MP] Health Check by @Oasis-Git in #2645
- [Observability] Relocate MP observability to lmcache/v1/mp_observability by @ApostaC in #2657
- [CI][Temp fix] make threshold a soft fail for multiprocessing test by @ApostaC in #2661
- Refactor PrometheusController into global singleton with self-registration by @ApostaC in #2659
- [3/4][MP] L2 Store controller for MP mode by @ApostaC in #2646
- [4/4][MP] L2 Prefetch controller foundation by @ApostaC in #2658
- [MP] Enable layout desc in MP lookup and prefetch by @ApostaC in #2662
- Add Pythonhashseed in quickstart example by @jmkuebler in #2597
- Fixes #2556: Assertion when remote backend is enabled without local CPU backend by @hlin99 in #2557
New Contributors
- @jmkuebler made their first contribution in #2597
Full Changelog: v0.3.14...v0.3.15