[Feat]New radix cache backend: pegaflow by jimmy-evo · Pull Request #17221 · sgl-project/sglang

jimmy-evo · 2026-01-16T11:18:42Z

Hi from novita.ai team 👋

A new KV cache backend has been adapted: Pegaflow

PegaFlow centralizes machine-level KV cache management into a standalone Rust process. Inference engines map their KV cache to PegaFlow via CUDA IPC, enabling D2H/H2D transfers to occur in a separate process while communicating through gRPC.

Key Benefits

Memory pooling across instances: Multiple model instances on a single machine can share a unified CPU KV cache pool. For example, 8 GPUs each serving a 3B model can share the same pinned memory pool.
Single-copy storage for TP-sharded MLA: When deploying MLA architectures with tensor parallelism, only one copy of the KV cache needs to be stored since all TP ranks connect to the same PegaFlow server.
GIL-free offload/prefetch logic: Remote storage interactions (offload, prefetch) run in Rust threads within the PegaFlow server process, avoiding Python GIL contention entirely.
Layer-first storage from day one: PegaFlow uses layer-first layout as its native storage format, with optimizations specifically designed to maximize overlap benefits from layer-wise transfers.

Modifications

enable with argument

--enable-pegaflow

Accuracy Tests

sglang run eval mmlu result:
device: h200 TP8 Deepseek-V3.2

python3 -m sglang.test.run_eval --port 8031 --eval-name mmlu --num-examples 100 --num-threads 16
{'other': np.float64(0.95), 'other:std': np.float64(0.21794494717703372), 'score:std': np.float64(0.2861817604250837), 'stem': np.float64(0.9047619047619048), 'stem:std': np.float64(0.29354352395090366), 'humanities': np.float64(0.8571428571428571), 'humanities:std': np.float64(0.3499271061118826), 'social_sciences': np.float64(0.9583333333333334), 'social_sciences:std': np.float64(0.19982631347136331), 'score': np.float64(0.91)}

Benchmark

H20-3e TP8

No radix cache

python3 -m sglang.launch_server  --model-path /data/models/GLM-4.7-FP8/ --served-model-name glm47 --trust-remote-code --page-size "128" --reasoning-parser glm45 --tool-call-parser glm47 --enable-metrics --collect-tokens-histogram  --enable-cache-report --host "0.0.0.0" --port 8000 --kv-cache-dtype fp8_e4m3 --mem-fraction-static "0.83" --max-running-requests "64" --max-prefill-tokens "24576" --chunked-prefill-size "32768" --tp-size "8" --disable-radix-cache

benchmark script:
input 4096
output 128

python -m sglang.bench_serving --seed 42 --backend sglang-oai-chat --model /data/models/GLM-4.7-FP8 --port 8000 --dataset-name random --random-input-len 4096 --random-output-len 128 --num-prompts 500

result:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  117.27
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    9
Request throughput (req/s):              4.26
Input token throughput (tok/s):          8945.05
Output token throughput (tok/s):         275.73
Peak output token throughput (tok/s):    499.00
Peak concurrent requests:                500
Total token throughput (tok/s):          9220.78
Concurrency:                             294.40
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   69047.03
Median E2E Latency (ms):                 70669.69
P90 E2E Latency (ms):                    112434.58
P99 E2E Latency (ms):                    115244.69
---------------Time to First Token----------------
Mean TTFT (ms):                          225.07
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2361.77
Median TPOT (ms):                        1011.14
P99 TPOT (ms):                           22333.73
---------------Inter-Token Latency----------------
Mean ITL (ms):                           42.63
Median ITL (ms):                         42.64
P95 ITL (ms):                            42.88
P99 ITL (ms):                            42.91
Max ITL (ms):                            42.92
==================================================

with pegaflow with 500gb memory

--enable-pegaflow

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  122.74
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    0
Request throughput (req/s):              4.07
Input token throughput (tok/s):          8545.95
Output token throughput (tok/s):         263.43
Peak output token throughput (tok/s):    500.00
Peak concurrent requests:                500
Total token throughput (tok/s):          8809.38
Concurrency:                             292.18
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   71726.88
Median E2E Latency (ms):                 72986.69
P90 E2E Latency (ms):                    117841.47
P99 E2E Latency (ms):                    120848.20
---------------Time to First Token----------------
Mean TTFT (ms):                          0.00
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2447.14
Median TPOT (ms):                        1053.13
P99 TPOT (ms):                           23152.92
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

after flush L1 cache

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  74.49
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    0
Request throughput (req/s):              6.71
Input token throughput (tok/s):          14082.11
Output token throughput (tok/s):         434.08
Peak output token throughput (tok/s):    500.00
Peak concurrent requests:                500
Total token throughput (tok/s):          14516.19
Concurrency:                             274.42
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40881.69
Median E2E Latency (ms):                 41759.84
P90 E2E Latency (ms):                    69137.59
P99 E2E Latency (ms):                    72521.94
---------------Time to First Token----------------
Mean TTFT (ms):                          0.00
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1379.12
Median TPOT (ms):                        605.74
P99 TPOT (ms):                           13373.51
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

hicache L2 only

request at least 640gb memory

python3 -m sglang.launch_server --model-path /data/models/GLM-4.7-FP8/ --served-model-name glm47 --trust-remote-code --page-size "128" --reasoning-parser glm45 --tool-call-parser glm47 --enable-metrics --collect-tokens-histogram  --enable-cache-report --host "0.0.0.0" --port 8000 --kv-cache-dtype fp8_e4m3 --mem-fraction-static "0.83" --max-running-requests "64" --max-prefill-tokens "24576" --chunked-prefill-size "32768" --tp-size "8" --enable-hierarchical-cache --hicache-size 80 --hicache-write-policy write_through --hicache-io-backend direct  --hicache-mem-layout layer_first --max-total-token 280000

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  122.38
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    8
Request throughput (req/s):              4.09
Input token throughput (tok/s):          8570.96
Output token throughput (tok/s):         264.20
Peak output token throughput (tok/s):    499.00
Peak concurrent requests:                500
Total token throughput (tok/s):          8835.16
Concurrency:                             292.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   71647.15
Median E2E Latency (ms):                 73111.48
P90 E2E Latency (ms):                    117442.14
P99 E2E Latency (ms):                    120417.76
---------------Time to First Token----------------
Mean TTFT (ms):                          235.33
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2444.20
Median TPOT (ms):                        1051.54
P99 TPOT (ms):                           23117.02
---------------Inter-Token Latency----------------
Mean ITL (ms):                           44.09
Median ITL (ms):                         43.97
P95 ITL (ms):                            44.72
P99 ITL (ms):                            44.83
Max ITL (ms):                            44.86
==================================================

populated:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  74.45
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    9
Request throughput (req/s):              6.72
Input token throughput (tok/s):          14089.16
Output token throughput (tok/s):         434.30
Peak output token throughput (tok/s):    499.00
Peak concurrent requests:                500
Total token throughput (tok/s):          14523.46
Concurrency:                             274.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40911.48
Median E2E Latency (ms):                 41672.92
P90 E2E Latency (ms):                    69527.26
P99 E2E Latency (ms):                    72407.62
---------------Time to First Token----------------
Mean TTFT (ms):                          139.29
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1380.84
Median TPOT (ms):                        608.65
P99 TPOT (ms):                           13474.01
---------------Inter-Token Latency----------------
Mean ITL (ms):                           43.79
Median ITL (ms):                         43.70
P95 ITL (ms):                            44.52
P99 ITL (ms):                            44.61
Max ITL (ms):                            44.63
==================================================s

Acc mmlu

First time

Total latency: 162.088 s
Score: 0.753
[METRIC] mmlu_score=0.7533333333333333 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
[METRIC] mmlu_latency=162.08847578521818 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
Writing report to /tmp/mmlu__data_models_GLM-4.7-FP8_.html
{'other': np.float64(0.8405797101449275), 'other:std': np.float64(0.36606756348739383), 'score:std': np.float64(0.4310710176087256), 'stem': np.float64(0.8769230769230769), 'stem:std': np.float64(0.32852548467788645), 'humanities': np.float64(0.5656565
656565656), 'humanities:std': np.float64(0.49567047056102215), 'social_sciences': np.float64(0.8208955223880597), 'social_sciences:std': np.float64(0.3834397784676158), 'score': np.float64(0.7533333333333333)}

after flush_cache

Total latency: 156.795 s
Score: 0.760
[METRIC] mmlu_score=0.76 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
[METRIC] mmlu_latency=156.79523031320423 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
Writing report to /tmp/mmlu__data_models_GLM-4.7-FP8_.html
{'other': np.float64(0.855072463768116), 'other:std': np.float64(0.35202776236206146), 'score:std': np.float64(0.4270831300812525), 'stem': np.float64(0.8923076923076924), 'stem:std': np.float64(0.3099914104555367), 'humanities': np.float64(0.565656565
6565656), 'humanities:std': np.float64(0.49567047056102215), 'social_sciences': np.float64(0.8208955223880597), 'social_sciences:std': np.float64(0.3834397784676158), 'score': np.float64(0.76)}

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-16T11:19:04Z

Summary of Changes

Hello @jimmy-evo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Pegaflow, a new and advanced KV cache backend, into the system, significantly enhancing memory management and performance for large language models. Alongside this major feature, it refines the tracing mechanism to ensure better observability by propagating external trace contexts more effectively. Additionally, a crucial check for local model file integrity has been implemented to prevent runtime errors caused by incomplete Hugging Face model downloads.

Highlights

New KV Cache Backend: Introduced Pegaflow as an alternative radix cache backend, offering advanced features for KV cache management.
Enhanced Cache Capabilities: Pegaflow supports standalone KV cache offloading and sharing, layerwise transfer, asynchronous device-to-CPU copying, and memory-level acceleration (MLA).
Trace Context Propagation: Improved the propagation of external trace headers across Engine methods (generate, async_generate, encode, async_encode) and request input structures (GenerateReqInput, EmbeddingReqInput).
Model File Integrity Check: Added a utility to verify that all files referenced in .safetensors.index.json for Hugging Face models exist locally, preventing issues with incomplete model downloads.
Versioning Configuration: Updated pyproject.toml files to include a git_describe_command for setuptools_scm, which helps in more robust version determination from git tags.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (2)
- .github/workflows/release-docker-npu.yml
- .github/workflows/release-docker-xeon.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces pegaflow as a new radix cache backend, which can be enabled with the --enable-pegaflow flag. The implementation looks solid. Additionally, the PR includes several other improvements: it refactors trace context propagation for better clarity by using external_trace_header, and enhances the model loading process by adding checks for incomplete local snapshots, which improves robustness. The configuration for setuptools_scm is also updated.

I have one minor suggestion regarding a potential typo in the pegaflow import to improve naming consistency. Overall, these are great additions to the project.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/managers/scheduler.py (693-695)

There seems to be a typo in the import path and class name. The feature is named "pegaflow", but here it's written as "peagflow". For consistency, I suggest renaming peagflow_radix_cache to pegaflow_radix_cache and PeagflowRadixCache to PegaflowRadixCache in the pegaflow library, and updating the import here accordingly.

                from pegaflow.sglang.pegaflow_radix_cache import PegaflowRadixCache

                self.tree_cache = PegaflowRadixCache(

hzh0425 · 2026-01-16T15:42:48Z

Could you please paste the performance benchmark comparison results?

jimmy-evo · 2026-01-26T07:09:21Z

/gemini summary

gemini-code-assist · 2026-01-26T07:09:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

jimmy-evo · 2026-01-26T10:06:07Z

Could you please paste the performance benchmark comparison results?

@hzh0425
I have update PR with benchmark results, pegaflow can save a lot memory to cache

jimmy-evo requested review from CatherineSue, Fridge003, JustinTong0323, Kangyan-Zhou, Ying1123, hnyls2002, ispobock, merrymercy, slin1237 and xiezhq-hermann as code owners January 16, 2026 11:18

github-actions Bot added dependencies Pull requests that update a dependency file npu labels Jan 16, 2026

support pegaflow

cd4941b

jimmy-evo force-pushed the feat/pegaflow_adapt branch from b79bf7f to cd4941b Compare January 16, 2026 11:20

gemini-code-assist Bot reviewed Jan 16, 2026

View reviewed changes

feat(pega): support scheduler graceful exit, to unregister pegaflow

70c115a

feat: better signal handler

ead80af

refactor: rename unregister -> shutdown

bf09afc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]New radix cache backend: pegaflow#17221

[Feat]New radix cache backend: pegaflow#17221
jimmy-evo wants to merge 4 commits into
sgl-project:mainfrom
novitalabs:feat/pegaflow_adapt

jimmy-evo commented Jan 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

hzh0425 commented Jan 16, 2026

Uh oh!

jimmy-evo commented Jan 26, 2026

Uh oh!

gemini-code-assist Bot commented Jan 26, 2026

Uh oh!

jimmy-evo commented Jan 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimmy-evo commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Benefits

Modifications

Accuracy Tests

Benchmark

No radix cache

with pegaflow with 500gb memory

after flush L1 cache

hicache L2 only

Acc mmlu

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/sglang/srt/managers/scheduler.py (693-695)

Uh oh!

hzh0425 commented Jan 16, 2026

Uh oh!

jimmy-evo commented Jan 26, 2026

Uh oh!

gemini-code-assist Bot commented Jan 26, 2026

Uh oh!

jimmy-evo commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimmy-evo commented Jan 16, 2026 •

edited

Loading

jimmy-evo commented Jan 26, 2026 •

edited

Loading