Overview

Relevant source files

Purpose and Scope

This document provides a high-level introduction to LMCache, explaining its role in the LLM inference stack, core architectural components, and operational principles. It is intended to orient new users and developers to the system before diving into specific subsystems.

For detailed information on specific topics:

Installation and setup: see Getting Started
CLI tools and metrics: see CLI Reference
Storage backends and multi-tier caching: see Storage Architecture
Integration with specific serving engines: see Integration with Serving Engines
Configuration and deployment: see Configuration and Deployment

Sources: README.md33-51 docs/source/index.rst36-43

What is LMCache?

LMCache is an LLM serving engine extension designed to reduce TTFT (Time-to-First-Token) and increase throughput, particularly in long-context scenarios. It achieves this by storing and reusing the Key-Value (KV) caches of reusable text segments across the datacenter—including GPU, CPU, Disk, and remote storage like S3 or Redis README.md33-35

Unlike traditional prefix-caching, LMCache can reuse the KV caches of any reused text segment (not necessarily a prefix) in any serving engine instance README.md35-37 This saves precious GPU cycles that would otherwise be spent on redundant prefill computations.

Key Features:

High-performance Offloading: Efficiently moves KV caches between GPU and CPU/Disk README.md55-58
P2P Sharing: Enables KV cache sharing between different engine instances via P2PBackend lmcache/v1/config.py126-141
Disaggregated Prefill: Separates prefill and decode phases for optimized resource utilization using PDBackend lmcache/v1/config.py189-213
Multi-tier Storage: Orchestrates cache across a hierarchy of storage backends managed by StorageManager lmcache/v1/cache_engine.py166-167
Multimodal Support: Supports KV cache for audio, image, and video models by hashing multimodal features lmcache/integration/vllm/vllm_v1_adapter.py188-202

Sources: README.md33-40 README.md55-63 docs/source/index.rst38-43 lmcache/v1/config.py126-213 lmcache/integration/vllm/vllm_v1_adapter.py188-202

System Architecture Overview

The following diagram illustrates how LMCache components bridge the gap between high-level serving engines and low-level storage/GPU operations.

LMCache Component Architecture

Sources: lmcache/integration/vllm/vllm_v1_adapter.py12-45 lmcache/v1/cache_engine.py78-103 lmcache/v1/token_database.py38-47 docs/source/getting_started/installation.rst109-123

Core Components

LMCacheEngine

The LMCacheEngine is the primary orchestrator lmcache/v1/cache_engine.py78-93 It manages the lifecycle of KV caches by coordinating between token processing, storage management, and GPU memory transfers. It handles the conversion of GPU KV caches into MemoryObj instances for storage lmcache/v1/cache_engine.py81-84

Token Database and Chunking

The TokenDatabase abstract class and its implementations (ChunkedTokenDatabase, SegmentTokenDatabase) manage the relationship between token sequences and cache keys lmcache/v1/token_database.py38-47

Chunk-based hashing: Tokens are processed into fixed-size chunks (default chunk_size=256) to enable modular reuse lmcache/v1/config.py64
Separator-based splitting: Used for advanced features like Cache Blending where segments are defined by specific token separators lmcache/v1/token_database.py45-47

Storage Manager and Backends

The StorageManager implements a multi-tier storage strategy lmcache/v1/cache_engine.py166-167 It searches for KV caches across a hierarchy defined in LMCacheEngineConfig lmcache/v1/config.py65-83:

Local CPU: Fast RAM-based caching (local_cpu).
Local Disk: Persistent storage (local_disk).
Remote Storage: Shared backends like Redis or S3 (remote_url).

Sources: lmcache/v1/cache_engine.py78-93 lmcache/v1/token_database.py38-47 lmcache/v1/config.py64-83

Integration with Serving Engines

LMCache is designed to integrate seamlessly with major LLM serving engines:

vLLM Integration

LMCache is integrated with vLLM v1 using the KVConnectorBase_V1 interface lmcache/integration/vllm/vllm_v1_adapter.py12-16 It tracks requests using RequestTracker to calculate LoadSpec and SaveSpec lmcache/integration/vllm/vllm_v1_adapter.py61-106 This allows vLLM to automatically offload and retrieve KV caches during the request lifecycle. For details, see Getting Started.

SGLang Integration

Integration with SGLang is enabled via the --enable-lmcache flag README.md59 It supports KV cache offloading and utilizes configuration parameters defined in LMCacheEngineConfig lmcache/v1/config.py62-220

Serving Engine Request Lifecycle

Sources: lmcache/integration/vllm/vllm_v1_adapter.py61-202 lmcache/v1/cache_engine.py78-93 docs/source/getting_started/installation.rst109-123

Getting Started and Deployment

Users can get started with LMCache in several ways:

Pip: pip install lmcache for stable releases README.md68-72
Docker: Pre-built images like lmcache/vllm-openai for easy deployment docs/source/getting_started/installation.rst170-176
Source: Installation using uv and setup.py for development or custom torch versions docs/source/getting_started/installation.rst80-101
Kubernetes: Deployment via the LMCache Kubernetes Operator and Custom Resource Definitions (CRDs).

For a step-by-step guide on installation and running your first cached request, see the Getting Started child page. For details on the lmcache command-line tool, see the CLI Reference.

Sources: README.md66-80 docs/source/getting_started/installation.rst21-33 docs/source/getting_started/installation.rst161-177 docs/source/index.rst135-136

Overview

Relevant source files

Purpose and Scope

For detailed information on specific topics:

Installation and setup: see Getting Started
CLI tools and metrics: see CLI Reference
Storage backends and multi-tier caching: see Storage Architecture
Integration with specific serving engines: see Integration with Serving Engines
Configuration and deployment: see Configuration and Deployment

Sources: README.md33-51 docs/source/index.rst36-43

What is LMCache?

Key Features:

High-performance Offloading: Efficiently moves KV caches between GPU and CPU/Disk README.md55-58
P2P Sharing: Enables KV cache sharing between different engine instances via P2PBackend lmcache/v1/config.py126-141
Disaggregated Prefill: Separates prefill and decode phases for optimized resource utilization using PDBackend lmcache/v1/config.py189-213
Multi-tier Storage: Orchestrates cache across a hierarchy of storage backends managed by StorageManager lmcache/v1/cache_engine.py166-167
Multimodal Support: Supports KV cache for audio, image, and video models by hashing multimodal features lmcache/integration/vllm/vllm_v1_adapter.py188-202

Sources: README.md33-40 README.md55-63 docs/source/index.rst38-43 lmcache/v1/config.py126-213 lmcache/integration/vllm/vllm_v1_adapter.py188-202

System Architecture Overview

The following diagram illustrates how LMCache components bridge the gap between high-level serving engines and low-level storage/GPU operations.

LMCache Component Architecture

Sources: lmcache/integration/vllm/vllm_v1_adapter.py12-45 lmcache/v1/cache_engine.py78-103 lmcache/v1/token_database.py38-47 docs/source/getting_started/installation.rst109-123

Core Components

LMCacheEngine

Token Database and Chunking

Chunk-based hashing: Tokens are processed into fixed-size chunks (default chunk_size=256) to enable modular reuse lmcache/v1/config.py64
Separator-based splitting: Used for advanced features like Cache Blending where segments are defined by specific token separators lmcache/v1/token_database.py45-47

Storage Manager and Backends

Local CPU: Fast RAM-based caching (local_cpu).
Local Disk: Persistent storage (local_disk).
Remote Storage: Shared backends like Redis or S3 (remote_url).

Sources: lmcache/v1/cache_engine.py78-93 lmcache/v1/token_database.py38-47 lmcache/v1/config.py64-83

Integration with Serving Engines

LMCache is designed to integrate seamlessly with major LLM serving engines:

vLLM Integration

SGLang Integration

Serving Engine Request Lifecycle

Sources: lmcache/integration/vllm/vllm_v1_adapter.py61-202 lmcache/v1/cache_engine.py78-93 docs/source/getting_started/installation.rst109-123

Getting Started and Deployment

Users can get started with LMCache in several ways:

Pip: pip install lmcache for stable releases README.md68-72
Docker: Pre-built images like lmcache/vllm-openai for easy deployment docs/source/getting_started/installation.rst170-176
Source: Installation using uv and setup.py for development or custom torch versions docs/source/getting_started/installation.rst80-101
Kubernetes: Deployment via the LMCache Kubernetes Operator and Custom Resource Definitions (CRDs).

For a step-by-step guide on installation and running your first cached request, see the Getting Started child page. For details on the lmcache command-line tool, see the CLI Reference.

Sources: README.md66-80 docs/source/getting_started/installation.rst21-33 docs/source/getting_started/installation.rst161-177 docs/source/index.rst135-136

Overview

Purpose and Scope

What is LMCache?

System Architecture Overview

Core Components

LMCacheEngine

Token Database and Chunking

Storage Manager and Backends

Integration with Serving Engines

vLLM Integration

SGLang Integration

Getting Started and Deployment

On this page

Overview

Purpose and Scope

What is LMCache?

System Architecture Overview

Core Components

LMCacheEngine

Token Database and Chunking

Storage Manager and Backends

Integration with Serving Engines

vLLM Integration

SGLang Integration

Getting Started and Deployment

On this page