Releases · LMCache/LMCache

Nightly CUDA 13.0 wheels built from dev on 2026-05-09.

uv pip install lmcache --pre \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --find-links https://github.com/LMCache/LMCache/releases/expanded_assets/nightly-cu13 \
  --index-strategy unsafe-best-match

Nightly CUDA 12.9 wheels built from dev on 2026-05-09.

uv pip install lmcache --pre \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --find-links https://github.com/LMCache/LMCache/releases/expanded_assets/nightly \
  --index-strategy unsafe-best-match

CUDA 13.0 wheel for LMCache v0.4.4.

uv pip install lmcache==v0.4.4 \
      --extra-index-url https://download.pytorch.org/whl/cu130 \
      --find-links https://github.com/LMCache/LMCache/releases/expanded_assets/v0.4.4-cu13 \
      --index-strategy unsafe-best-match

@maobaolong

What's Changed

Refactor remote plugin to accept multiply connector by @maobaolong in #2666
[MP]feat: support different kv cache shape and dtype across layers by @liuyumoye in #2926
[Chore][CI]: K3 base CI image 12.9 CUDA by @sammshen in #2975
fix: use pin=False in _allocate_and_put to prevent pd_buffer leak by @ningziwen in #2847
feat(disk): support multi-path local disk backend for multi-device I/O by @glimchb in #2801
[Chore][CI] Upgrade CI base image to CUDA 13.0 by @sammshen in #2981
[doc] document long-doc-permutator workload in cli bench by @deng451e in #2963
[MP][Bugfix] Fix deadlock caused by cuda launch host func by @ApostaC in #2952
[BugFix]: Fix typo bug by @princepride in #2980
[CI] Pin cu128 nightly wheel for blend ci test by @deng451e in #2987
[MP][optimize] optimize save when mla enabled by @chunxiaozheng in #2935
[hotfix] fix prometheus version for UT failure by @ApostaC in #3000
Update LMCache Office Hours to Wednesday by @nijaba in #2990
[fix] Limit proxy in-flight requests to prevent PD buffer deadlock by @deng451e in #2957
[MP] Lazy start heartbeat thread when first req coming by @maobaolong in #2943
[Operator] Add L2 RESP (Redis/Valkey) adapter support by @royyhuang in #2967
[Feat][RawBlock] Add TP>1 support and compact batched retrieval path by @DongDongJu in #2948
[MP] Introduce a simple way to register_gauge metrics. by @maobaolong in #2906
[Build] Add lmcache-cli lightweight wheel by @deng451e in #2959
Copy a snapshot of lmcache_mp_connector.py for vllm 0.18.0 by @maobaolong in #2887
[MP] Add a new argument to specify whether retain_in_l1 by @maobaolong in #2813
[Chore][CI] Skip k3 builds when only docs/trivial files changed by @sammshen in #2993
[ops][refactor] Add full list of Python fallbacks to run without compiled CUDA extensions by @hlin99 in #2591
[Feat] L0 Subscriber by @Oasis-Git in #2974
refactor: extract PathSharder module for shared multi-path selection by @glimchb in #2982
refactor(mp): replace job_id with request_id in query_prefetch_status by @yoo-kumaneko in #2996
[MP] Support lazy import built-in l2 adapter by @maobaolong in #2905
[MP][Optimize] Skip locked keys during LRU eviction to improve eviction efficiency by @chunxiaozheng in #2978
fix: add controller config validation and clear error messages (#2907) by @ianliuy in #3003
feat: add chunk hashes logger to MP server for offline data analysis by @yoo-kumaneko in #2928
[Chore][CI]: K3 MP output token quantity tolerance by @sammshen in #3030
feat(tools): add LRU cache simulator for lookup-hash JSONL logs by @yoo-kumaneko in #3021
[Feat] L1 Subscriber by @Oasis-Git in #2986
[Feat] Add cache_salt parameter to MP adapter interfaces by @royyhuang in #3029
[Feat] Add is_user_level property and cache_salt param to EvictionPolicy by @royyhuang in #3032
[Feat][DAX] Optimize staged batched restore path and document modification by @DongDongJu in #2904
[Chore] Remove v0 code by @sammshen in #2968
[Chore] add coding standard and PR review instructions by @ApostaC in #3039
[Observability] Per-request root OTel span and SpanRegistry for MP server tracing by @deng451e in #3033
feat(pd_backend): add pd_skip_proxy_notification to skip ZMQ proxy notification by @ningziwen in #2874
[Bugfix] fix some memory leak in cache_engine and eic connector by @liubj77 in #2544
[Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain by @sammshen in #3055
[Hotfix][CI] Pin vLLM nightly to cu130 index to match CUDA 13 base image by @ApostaC in #3061
[Docs] Mirror lmcache/ layout in docs/design/ for discoverability by @ApostaC in #3040
Add scheduler instance_id and model_name to L0 KV lifecycle tracking by @Oasis-Git in #3043
chore: expose package version via init.py by @hlin99 in #3034
Fix: Safely handle layerwise cache shape dimensions in remote backend by @hlin99 in #2751
[Core] Add persistence interfaces and nixl persistence by @YaoJiayi in #2938
[Misc] Reduce the logs generated by lazy memory allocator by @ApostaC in #3068
[MP][Feat] Add cache_salt to ObjectKey for cache isolation by @royyhuang in #3042
[ROCm] Make bare-host ROCm install self-sufficient by @Shaoting-Feng in #3070
[MP] Add tracing functionality for storage manager by @ApostaC in #3063
[MP][optimize] unified touch all keys in end session request by @chunxiaozheng in #3020
[step3] remove unnecessary code in mp adapter by @chunxiaozheng in #2994
fix(mp): correct store cached requests in lmcache_mp_connector by @maobaolong in #3012
[refactor]: Replace use_cufile with use_gds/gds_backend config flags by @glimchb in #2858
[CI] Add cu13.0 wheel + container builds and nightly wheel releases by @deng451e in #3069
[CI] Run the same test set on AMD as on NVIDIA by @Shaoting-Feng in #3071
[ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB by @Shaoting-Feng in #3079
[CLI] Refactor query command by @deng451e in #2995
[CI] add missing egress endpoints to nightly Docker build by @deng451e in #3087
[Hotfix][CI] Fail-fast when vLLM CLI import chain is broken post-install by @sammshen in #3093
[CLI][fix] lazy torch import in init.py to unblock CLI-only installs by @deng451e in #3086
[CLI] Introduce lmcache trace CLI by @ApostaC in #3075
[Chore][Docs]: daily drift check — multi-process mode by @ApostaC in #3076
[Fix][CI] fix nightly wheel versioning and build reliability by @deng451e in #3097
[Hotfix][CI] Replace vllm main.py patch with sitecustomize.py by @sammshen in #3100
[CI] fix blend-server venv by @deng451e in #3099
[MP] Introduce MP runtime plugin framework by @maobaolong in #2956

New Contributors

@ianliuy made their first contribution in #3003

Full Changelog: v0.4.3...v0.4.4

@maobaolong

What's Changed

[MP] fix: add thread safety to Session for concurrent TP worker access by @maobaolong in #2807
[CLI] Implement initial framework of LMCache CLI by @KuntaiDu in #2775
[MP][Observability][1/3] EventBus core infrastructure + OpenTelemetry dependency by @royyhuang in #2792
[MP]: Support delay start heartbeat thread to avoid unhealthy while start vllm for a huge module warmup. by @maobaolong in #2798
fix: add None check before stream synchronization by @hlin99 in #2810
[Core] Add VRAM_SEG support for NIXL OBJ plugin by @jgoldsch12 in #2640
[CI]: create fallback for flaky nightly index by @sammshen in #2809
[CI]: add full tag selectively by @sammshen in #2820
fix: replace global lock with per-device transfer_lock to prevent deadlock by @maobaolong in #2816
Refactor KV cache shape/dtype extraction for robustness by @hlin99 in #2537
Support non-contiguous alloc in MemoryAllocator by @chunxiaozheng in #2767
[MP][Observability][2/3] Migrate L1 + SM to EventBus + OTel, remove old Prometheus pipeline by @royyhuang in #2794
[MP][Bugfix] fixing race condition for zmq output notifier by @ApostaC in #2808
[ci]: agent reviewer prompt engineering by @sammshen in #2800
[refactor]: clean up the messy LMCacheManager by @sammshen in #2683
[Platform]: Add Intel Gaudi (HPU) Support by @hlin99 in #2822
[CLI] Implement lmcache describe kvcache subcommand by @royyhuang in #2825
[MP][Feat] Query lookup-phase status for MP mode by @ApostaC in #2818
Add Device-DAX (/dev/dax) storage backend for KV cache (follow-up to #2714) by @jayhpark530 in #2788
[Temp CI Patch]: torch version for UT by @sammshen in #2856
[CI] Add GitHub Action to auto-sync torch version with vLLM by @deng451e in #2796
[MP][Feat] support worker-affinity in the MQ thread pool by @ApostaC in #2842
Introduce native fs connector by @maobaolong in #2779
[CLI] Implement lmcache ping subcommand by @Oasis-Git in #2859
[MP] Fault Tolerance CI by @Oasis-Git in #2764
feat: improve ValkeyConnector with cluster mode, TLS, and GLIDE optimizations by @omerrubi-amzn in #2790
fix: auto-generate lmcache_instance_id when value is None by @can-sun in #2732
[CI]: use job-level path filtering so skipped tests pass required checks by @royyhuang in #2855
[MP] Print inference request id to help identify which vllm request the current log belongs to by @maobaolong in #2812
[HW: XPU] Enable Layerwise XPU Connector by @slokesha in #2611
[CLI] lmcache query engine subcommand by @deng451e in #2846
[CLI]: Server command by @sammshen in #2836
[LMCache CLI] Design and implementation of lmcache kvcache by @KuntaiDu in #2827
[Bugfix]: Fix pin count balancing in PD Disaggregation mode by @lisiG9 in #2786
[Core] [GDS] Improve GDS backend error handling and retry logic by @oferki in #2675
[CLI][Doc] Edit the doc for LMCache CLI by @KuntaiDu in #2870
Add hipFile support for AIS (AMD Infinity Storage) storage by @glimchb in #2799
[CI]: Fix the LMCache random throughput being higher than native vllm by @sammshen in #2864
[3/N][Feat]Persist metadata on device and fix raw-device benchmark setup by @DongDongJu in #2614
[Core]: Support HND KV Format by @sammshen in #2826
[Chore][Docs] Fix mp docs for store policy: skip_l1 by @ApostaC in #2869
[MP][Core] Block id based kernel for MP mode by @ApostaC in #2838
[CLI] update cli lmcache query engine by @deng451e in #2871
[MP] Improve the stability for controllers and improve log clarity by @ApostaC in #2883
[Chore][Docs] Stale MP CLI and Flags by @sammshen in #2882
[Fix][Operator] Add privileged mode and nvidia runtime for GPU visibility by @royyhuang in #2749
[Chore][CI]: chmod +x scripts in k3 test entrypoints by @sammshen in #2886
feat(gds): add multipath KV-cache offloading support by @glimchb in #2817
fix: add missing lock protection for LRU cache policy by @SYaoJun in #2860
[MP][Observability][3/3] Migrate MP server telemetry to EventBus, unify config by @royyhuang in #2806
[doc] update installation compatibility doc by @deng451e in #2868
[Build] add SM120 for wheel build by @deng451e in #2873
[1/2] L2 CI: End to End Performance by @Oasis-Git in #2884
[fix] add missing request type in blend server by @deng451e in #2894
type: Add missing return type annotations to storage backend methods by @SYaoJun in #2829
[CLI] Implementation of lmcache bench engine by @ApostaC in #2889
feat(gds): enable parallel I/O thread pool for all cuFile filesystems by @glimchb in #2802
[DSA] support DSA in Mooncake connector by @chunxiaozheng in #2897
[Core] Add L2 eviction in mp mode by @YaoJiayi in #2824
[Bugfix] fix the invalid image path by @SYaoJun in #2899
[Chore][CI] Split k3 multiprocess tests into parallel pipeline steps by @sammshen in #2914
Support l2 adapter check and improve basic_check tool by @maobaolong in #2895
[Chore][CI/Docs]: Switch all the documentation and CI over to lmache cli by @sammshen in #2917
[CI] Add CI test for CB by @deng451e in #2900
[2/2] L2 CI: Telemetry Test by @Oasis-Git in #2913
[Core] Add eviction for CB by @YaoJiayi in #2893
Refactor: Generalize utils.py for all devices by lifting the CUDA limitation by @hlin99 in #2848
Add argument --prefetch-max-in-flight to fix hardcode by @maobaolong in #2789
[MP] Refactor l2 plugin framework to support dynamic load third-party native l2 connector by @maobaolong in #2851
fix: relax worker port count assertion by @can-sun in #2867
[Bugfix]: patch save_decode_cache by @sammshen in #2929
vllm block event by @Oasis-Git in #2930
[Feat]: Add eviction to L2 Native Backend by @sammshen in #2939
[Connector] Maru: zero-copy KV cache sharing via CXL shared memory by @jooho-XCENA in #2705
[MP] Fix UT after merge #2851 by @maobaolong in #2931
[Bugfix]: fix get_num_heads for MLA format by @sammshen in #2941
[MP] Introduce l2 mooncake adapter by @maobaolong in #2911
[CLI]Add long-doc-permutator CLI bench workload by @deng451e in #2937
feat(gds): add gds_path_sharding config for multi-path strategy by @glimchb in #2922
[Security][Remote Connector]: Add env var auth config for RESP by @sammshen in #2949
Refactor: Align pd_buffer_size to chunk size in PD backend by @hlin99 in #2694
[Chore] Add CODEOWNERS for automated PR review assignments by @sammshen in #2950
[Chore][CI]: Change dst for K3 nightly comprehensive results by @sammshen in #2958

New Contributors

@jgoldsch12 made their first contribution in #2640
@jayhpark530 made their first contribution in #2788
@omerrubi-amzn made their...

@liuyumoye

What's Changed

fix(l1_manager): propagate extra_count through prefetch path to prevent premature eviction by @liuyumoye in #2725
[vllm adapter] num_lmcache_cached_tokens by @aeon-x in #2670
[ci]: add gpu monitoring by @sammshen in #2718
[CI][Hotfix][Chore] remove the repetitive definition of report_status by @ApostaC in #2745
[Perf] [GDS] Performance improvements to GDS backend by @oferki in #2637
Fault Tolerance Check by @Oasis-Git in #2692
[Misc] Remove Hash from IPCCacheEngineKey by @Oasis-Git in #2700
[MP][optimize] optimize evict in lru policy by @chunxiaozheng in #2740
[RFC] Design of LMCache CLI by @KuntaiDu in #2748
[MP][Bugfix] introducing new l1 listener to prevent re-storing prefetched object by @ApostaC in #2744
Add filesystem-backed L2 adapter with auto-discovery plugin mechanism by @maobaolong in #2704
fix(server): guard finish_read_prefetched behind retrieve_succeeded flag by @maobaolong in #2736
Fix[config]: replace store_true with BooleanOptionalAction for --l1-use-lazy by @liuyumoye in #2761
[Correctness]: Fix the overlapping race condition for non-MP as well by @sammshen in #2706
[Southbound]: Create a Native Protocol for MP and non-MP by @sammshen in #2642
[ci]: fix k3 comprehensive test nightly baseline retrieval by @sammshen in #2753
[Perf] Add stream priority in gpu context by @YaoJiayi in #2728
[Doc] Add doc for LMCache MP mode operator by @royyhuang in #2731
[Docs][Operator] Fix observability metric descriptions by @royyhuang in #2746
[MP][Feat] Support dedicated thread pool for MP callbacks by @ApostaC in #2763
[MP][UX][L2] Support configuring L2 store/prefetch policy via command line by @ApostaC in #2773
Fix regression: restore config validate() call in config.py by @hlin99 in #2690
[MP] Support buffer only mode for MP mode by @maobaolong in #2760
Plugin L2 Adapter Framework for MP Mode by @maobaolong in #2715
[MP][Bugfix] fix free error when memory_objs is empty by @chunxiaozheng in #2768
update torch version aligned with vllm by @deng451e in #2782
Support database option at Valkey connector by @bluayer in #2307
feat(kv_cache): enable asymmetric store/retrieve storages in PD backend by @hlin99 in #2509

Full Changelog: v0.4.1...v0.4.2

@ApostaC

What's Changed

[MP][Bugfix] fix vllm-side lookup logical issue and cuda stream deadlock problem by @ApostaC in #2733

Full Changelog: v0.4.0...v0.4.1

@royyhuang

Major Milestones

v0.4.0 marks the maturation and shift in LMCache towards the new Multiprocess mode.

What's Changed

[feat] add free_locks api to MP mode by @royyhuang in #2656
[Add] L2 Prefetch Controller and StorageManager integration by @ApostaC in #2667
K3 CI Refactor by @sammshen in #2663
Tell agent to write documentations by @KuntaiDu in #2655
Refactor new_block_ids handling for robustness by @hlin99 in #2536
[CI] Fix mypy errors by @hickeyma in #2672
fix(lmcache): fix KV cache hash inconsistency due to None in extra_keys by @JianDan0212 in #1897
Augmenting contributing.md by @KuntaiDu in #2654
Bump actions/download-artifact from 6.0.0 to 7.0.0 by @dependabot[bot] in #2397
[MP][UX] Unified config + argparse for multiprocess mode by @ApostaC in #2695
[CI]: 5 day maximum for Comprehensive Test flexibility by @sammshen in #2676
[Correctness]: Avoid overwriting APC overlap by @sammshen in #2671
[MP][Observability] Add telemetry subsystem for multiprocess mode by @ApostaC in #2696
[1/N] Support NIXL-based L2 storage in MP mode by @YaoJiayi in #2664
[Feat] LMCache MP mode k8s operator by @royyhuang in #2701
[MP][Telemetry] Hot-fix to enable the telemetry logging for store by @ApostaC in #2707
[Bugfix] Fix memory leak in asynchronous mode by @deng451e in #2559
[Misc] Improve nixl perf in lmcache mp by @YaoJiayi in #2711
[MP][Core] Update the workflow for lookup to avoid busy loop by @ApostaC in #2710
Fix to support mla multiple tp failed to read issue by @maobaolong in #2697
Refactor lookup client/server and abstract rpc layer. by @maobaolong in #2609
[Core] Add blend_server_v2 by @YaoJiayi in #2677
[Bugfix] fix crash in wait_for_save when retrieve fail from lmcache_engine by @liubj77 in #2516
[Chore][Docs] Update docs for MP mode by @ApostaC in #2708
[Misc] Fix failing unit test in blend server by @YaoJiayi in #2717
[MP][Debuggability] Introduce status report subsystem for MP-mode by @ApostaC in #2699
[MP][Hotfix] add default implementation for report_status by @ApostaC in #2723
[MP] Support MP Server restart by @maobaolong in #2713
Revert "[MP] Support MP Server restart (#2713)" by @ApostaC in #2729
[MP][UX][Docs] Enhance http server and its docs for MP mode by @ApostaC in #2722
[MP] Update the MP docs and pass telemetry config into http_server by @ApostaC in #2730

New Contributors

@JianDan0212 made their first contribution in #1897
@liubj77 made their first contribution in #2516

Full Changelog: v0.3.15...v0.4.0

Automated nightly operator build from dev branch.

Image: lmcache/lmcache-operator:nightly-20260510-d945fbb

kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-nightly-latest/install.yaml

@maobaolong

What's Changed

Introduce reset metrics api by @maobaolong in #2602
Add req id to store/store_layer/retrieve/retrieve_layer log by @maobaolong in #2604
Add an override inner field to support override extra config by @maobaolong in #2605
Add a ut for basic check by @maobaolong in #2612
[DOC] Introduce LMCache frontend document by @maobaolong in #2618
[DOC] Complete the internal_api_server api document by @maobaolong in #2617
Check failed put task count and record metrics by @maobaolong in #2439
[UT] Add UT for utils.py by @maobaolong in #2615
[Core] Add enum for EngineType by @hickeyma in #2555
Add hot cache switch internal api by @maobaolong in #2620
Add bundle of bypass backend internal apis by @maobaolong in #2619
[Observability]: Fix vllm cached and prompt tokens by @sammshen in #2576
[Bugfix] Fix layerwise wait_for_save concurrency crash with request-scoped storers by @DongDongJu in #2613
Add lookup api to support dynamic recreate lookup client/server by @maobaolong in #2625
refactor: read config values dynamically instead of caching in instance variables by @maobaolong in #2610
Using shm to reduce memory copy while using remote connector by @maobaolong in #2601
Add a backend api to support dynamic close&create backends by @maobaolong in #2622
Support customize the bucket of histogram metrics by @maobaolong in #2627
[Remote Connector]: cpp multi-threaded RESP by @sammshen in #2541
[2/N][Feat] Add zero-copy aligned buffer odirect by @DongDongJu in #2573
[1/4] Bitmap for L2 storage in MP mode by @ApostaC in #2563
[2/4] L2Adapter interface and implementation of MockL2Adapter by @ApostaC in #2569
[Misc] Adpot the new token matching solution by @ApostaC in #2599
[MP] Protocol with Single Key by @Oasis-Git in #2584
[feat] add observability stack to MP mode by @royyhuang in #2638
[Chore][Admin] Create initial AGENTS.md by @ApostaC in #2649
[MP] Health Check by @Oasis-Git in #2645
[Observability] Relocate MP observability to lmcache/v1/mp_observability by @ApostaC in #2657
[CI][Temp fix] make threshold a soft fail for multiprocessing test by @ApostaC in #2661
Refactor PrometheusController into global singleton with self-registration by @ApostaC in #2659
[3/4][MP] L2 Store controller for MP mode by @ApostaC in #2646
[4/4][MP] L2 Prefetch controller foundation by @ApostaC in #2658
[MP] Enable layout desc in MP lookup and prefetch by @ApostaC in #2662
Add Pythonhashseed in quickstart example by @jmkuebler in #2597
Fixes #2556: Assertion when remote backend is enabled without local CPU backend by @hlin99 in #2557

New Contributors

@jmkuebler made their first contribution in #2597

Full Changelog: v0.3.14...v0.3.15

Releases: LMCache/LMCache

Nightly 2026-05-09 · CUDA 13.0

Uh oh!

Nightly 2026-05-09 · CUDA 12.9

Uh oh!

Release v0.4.4 · CUDA 13.0

Uh oh!

v0.4.4

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.2

What's Changed

Contributors

Uh oh!

v0.4.1

What's Changed

Contributors

Uh oh!

v0.4.0

Major Milestones

What's Changed

New Contributors

Contributors

Uh oh!

Operator Nightly Latest (nightly-20260510-d945fbb)

Uh oh!

v0.3.15

What's Changed

New Contributors

Contributors

Uh oh!