This document provides a high-level introduction to LMCache, explaining its role in the LLM inference stack, core architectural components, and operational principles. It is intended to orient new users and developers to the system before diving into specific subsystems.
For detailed information on specific topics:
Sources: README.md33-51 docs/source/index.rst36-43
LMCache is an LLM serving engine extension designed to reduce TTFT (Time-to-First-Token) and increase throughput, particularly in long-context scenarios. It achieves this by storing and reusing the Key-Value (KV) caches of reusable text segments across the datacenter—including GPU, CPU, Disk, and remote storage like S3 or Redis README.md33-35
Unlike traditional prefix-caching, LMCache can reuse the KV caches of any reused text segment (not necessarily a prefix) in any serving engine instance README.md35-37 This saves precious GPU cycles that would otherwise be spent on redundant prefill computations.
Key Features:
P2PBackend lmcache/v1/config.py126-141PDBackend lmcache/v1/config.py189-213StorageManager lmcache/v1/cache_engine.py166-167Sources: README.md33-40 README.md55-63 docs/source/index.rst38-43 lmcache/v1/config.py126-213 lmcache/integration/vllm/vllm_v1_adapter.py188-202
The following diagram illustrates how LMCache components bridge the gap between high-level serving engines and low-level storage/GPU operations.
LMCache Component Architecture
Sources: lmcache/integration/vllm/vllm_v1_adapter.py12-45 lmcache/v1/cache_engine.py78-103 lmcache/v1/token_database.py38-47 docs/source/getting_started/installation.rst109-123
The LMCacheEngine is the primary orchestrator lmcache/v1/cache_engine.py78-93 It manages the lifecycle of KV caches by coordinating between token processing, storage management, and GPU memory transfers. It handles the conversion of GPU KV caches into MemoryObj instances for storage lmcache/v1/cache_engine.py81-84
The TokenDatabase abstract class and its implementations (ChunkedTokenDatabase, SegmentTokenDatabase) manage the relationship between token sequences and cache keys lmcache/v1/token_database.py38-47
chunk_size=256) to enable modular reuse lmcache/v1/config.py64The StorageManager implements a multi-tier storage strategy lmcache/v1/cache_engine.py166-167 It searches for KV caches across a hierarchy defined in LMCacheEngineConfig lmcache/v1/config.py65-83:
local_cpu).local_disk).remote_url).Sources: lmcache/v1/cache_engine.py78-93 lmcache/v1/token_database.py38-47 lmcache/v1/config.py64-83
LMCache is designed to integrate seamlessly with major LLM serving engines:
LMCache is integrated with vLLM v1 using the KVConnectorBase_V1 interface lmcache/integration/vllm/vllm_v1_adapter.py12-16 It tracks requests using RequestTracker to calculate LoadSpec and SaveSpec lmcache/integration/vllm/vllm_v1_adapter.py61-106 This allows vLLM to automatically offload and retrieve KV caches during the request lifecycle.
For details, see Getting Started.
Integration with SGLang is enabled via the --enable-lmcache flag README.md59 It supports KV cache offloading and utilizes configuration parameters defined in LMCacheEngineConfig lmcache/v1/config.py62-220
Serving Engine Request Lifecycle
Sources: lmcache/integration/vllm/vllm_v1_adapter.py61-202 lmcache/v1/cache_engine.py78-93 docs/source/getting_started/installation.rst109-123
Users can get started with LMCache in several ways:
pip install lmcache for stable releases README.md68-72lmcache/vllm-openai for easy deployment docs/source/getting_started/installation.rst170-176uv and setup.py for development or custom torch versions docs/source/getting_started/installation.rst80-101For a step-by-step guide on installation and running your first cached request, see the Getting Started child page. For details on the lmcache command-line tool, see the CLI Reference.
Sources: README.md66-80 docs/source/getting_started/installation.rst21-33 docs/source/getting_started/installation.rst161-177 docs/source/index.rst135-136