Pegaflow

KV cache on the wings of Pegasus.

PegaFlow is a high-performance KV cache storage engine for LLM inference. Offload KV cache from GPU to host memory or SSD, and share it across nodes via RDMA.

Decoupled from inference lifecycle — runs as an independent sidecar; KV cache survives engine restarts, scales independently, and is shared across instances
Topology-aware, PCIe-saturating transfers — NUMA-aware pinned memory + layer-wise DMA to maximize hardware bandwidth
GIL-free Rust core — zero Python overhead on the hot path; your inference engine keeps its threads
Production-ready observability — built-in Prometheus metrics and OTLP export, not an afterthought
Pluggable — works with vLLM and SGLang as a drop-in KV connector

Framework Integration

Framework	Status	Link
vLLM	✅ Ready	Quick Start
SGLang	🚧 Under Review	PR #17221

Quick Start

1. Install

uv pip install pegaflow-llm        # CUDA 12
uv pip install pegaflow-llm-cu13   # CUDA 13

2. Start PegaFlow Server

pegaflow-server

3. Launch your inference engine

vLLM (recommended):

vllm serve Qwen/Qwen3-0.6B \
  --kv-transfer-config '{"kv_connector": "PegaKVConnector", "kv_role": "kv_both", "kv_connector_module_path": "pegaflow.connector"}'

SGLang:

python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3-0.6B \
  --enable-pegaflow

For full server options, multi-node setup, and advanced configuration, see Server Configuration.

Development

Build from source

export PYO3_PYTHON=$(which python)
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH

cargo run -r                    # start server
cd python && maturin develop -r # build Python bindings

We use Conventional Commits — run cz c for an interactive commit prompt.

Benchmarks

KV Cache Benchmark

H800 reference numbers with Llama-3.1-8B (8 prompts, 10K-token prefill, 1-token decode, 4.0 req/s):

Configuration	TTFT mean (ms)	TTFT p99 (ms)
PegaFlow (Cold)	572.5	1113.7
PegaFlow (Warm)	61.5	77.0

The warm-start path achieves ~9x faster TTFT compared to cold-start, demonstrating effective KV cache sharing across requests.

Documentation

Server Configuration — full CLI options, SSD cache, multi-node setup
P2P KV Cache Sharing — cross-node RDMA setup, tuning, and troubleshooting
P/D Router — prefill/decode disaggregation
vLLM I/O Patch — optional patch for better transfer throughput
Metrics — Prometheus and OTLP metrics reference
Goals & Non-Goals — project scope and design philosophy

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
.claude/skills		.claude/skills
.github/workflows		.github/workflows
analyse		analyse
assets		assets
docs		docs
examples		examples
pegaflow-common		pegaflow-common
pegaflow-core		pegaflow-core
pegaflow-metaserver		pegaflow-metaserver
pegaflow-proto		pegaflow-proto
pegaflow-server		pegaflow-server
pegaflow-transfer		pegaflow-transfer
python		python
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pegaflow.code-workspace		pegaflow.code-workspace
typos.toml		typos.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pegaflow

Framework Integration

Quick Start

1. Install

2. Start PegaFlow Server

3. Launch your inference engine

Development

Build from source

Benchmarks

KV Cache Benchmark

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pegaflow

Framework Integration

Quick Start

1. Install

2. Start PegaFlow Server

3. Launch your inference engine

Development

Build from source

Benchmarks

KV Cache Benchmark

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages