Skip to content

[Feat]New radix cache backend: pegaflow#17221

Open
jimmy-evo wants to merge 4 commits into
sgl-project:mainfrom
novitalabs:feat/pegaflow_adapt
Open

[Feat]New radix cache backend: pegaflow#17221
jimmy-evo wants to merge 4 commits into
sgl-project:mainfrom
novitalabs:feat/pegaflow_adapt

Conversation

@jimmy-evo

@jimmy-evo jimmy-evo commented Jan 16, 2026

Copy link
Copy Markdown
Contributor

Hi from novita.ai team 👋

A new KV cache backend has been adapted: Pegaflow

PegaFlow centralizes machine-level KV cache management into a standalone Rust process. Inference engines map their KV cache to PegaFlow via CUDA IPC, enabling D2H/H2D transfers to occur in a separate process while communicating through gRPC.

Key Benefits

  • Memory pooling across instances: Multiple model instances on a single machine can share a unified CPU KV cache pool. For example, 8 GPUs each serving a 3B model can share the same pinned memory pool.
  • Single-copy storage for TP-sharded MLA: When deploying MLA architectures with tensor parallelism, only one copy of the KV cache needs to be stored since all TP ranks connect to the same PegaFlow server.
  • GIL-free offload/prefetch logic: Remote storage interactions (offload, prefetch) run in Rust threads within the PegaFlow server process, avoiding Python GIL contention entirely.
  • Layer-first storage from day one: PegaFlow uses layer-first layout as its native storage format, with optimizations specifically designed to maximize overlap benefits from layer-wise transfers.
2026-01-16-1931

Modifications

enable with argument

--enable-pegaflow

Accuracy Tests

sglang run eval mmlu result:
device: h200 TP8 Deepseek-V3.2

python3 -m sglang.test.run_eval --port 8031 --eval-name mmlu --num-examples 100 --num-threads 16
{'other': np.float64(0.95), 'other:std': np.float64(0.21794494717703372), 'score:std': np.float64(0.2861817604250837), 'stem': np.float64(0.9047619047619048), 'stem:std': np.float64(0.29354352395090366), 'humanities': np.float64(0.8571428571428571), 'humanities:std': np.float64(0.3499271061118826), 'social_sciences': np.float64(0.9583333333333334), 'social_sciences:std': np.float64(0.19982631347136331), 'score': np.float64(0.91)}

Benchmark

H20-3e TP8

No radix cache

python3 -m sglang.launch_server  --model-path /data/models/GLM-4.7-FP8/ --served-model-name glm47 --trust-remote-code --page-size "128" --reasoning-parser glm45 --tool-call-parser glm47 --enable-metrics --collect-tokens-histogram  --enable-cache-report --host "0.0.0.0" --port 8000 --kv-cache-dtype fp8_e4m3 --mem-fraction-static "0.83" --max-running-requests "64" --max-prefill-tokens "24576" --chunked-prefill-size "32768" --tp-size "8" --disable-radix-cache

benchmark script:
input 4096
output 128

python -m sglang.bench_serving --seed 42 --backend sglang-oai-chat --model /data/models/GLM-4.7-FP8 --port 8000 --dataset-name random --random-input-len 4096 --random-output-len 128 --num-prompts 500

result:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  117.27
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    9
Request throughput (req/s):              4.26
Input token throughput (tok/s):          8945.05
Output token throughput (tok/s):         275.73
Peak output token throughput (tok/s):    499.00
Peak concurrent requests:                500
Total token throughput (tok/s):          9220.78
Concurrency:                             294.40
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   69047.03
Median E2E Latency (ms):                 70669.69
P90 E2E Latency (ms):                    112434.58
P99 E2E Latency (ms):                    115244.69
---------------Time to First Token----------------
Mean TTFT (ms):                          225.07
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2361.77
Median TPOT (ms):                        1011.14
P99 TPOT (ms):                           22333.73
---------------Inter-Token Latency----------------
Mean ITL (ms):                           42.63
Median ITL (ms):                         42.64
P95 ITL (ms):                            42.88
P99 ITL (ms):                            42.91
Max ITL (ms):                            42.92
==================================================

with pegaflow with 500gb memory

--enable-pegaflow
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  122.74
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    0
Request throughput (req/s):              4.07
Input token throughput (tok/s):          8545.95
Output token throughput (tok/s):         263.43
Peak output token throughput (tok/s):    500.00
Peak concurrent requests:                500
Total token throughput (tok/s):          8809.38
Concurrency:                             292.18
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   71726.88
Median E2E Latency (ms):                 72986.69
P90 E2E Latency (ms):                    117841.47
P99 E2E Latency (ms):                    120848.20
---------------Time to First Token----------------
Mean TTFT (ms):                          0.00
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2447.14
Median TPOT (ms):                        1053.13
P99 TPOT (ms):                           23152.92
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

after flush L1 cache

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  74.49
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    0
Request throughput (req/s):              6.71
Input token throughput (tok/s):          14082.11
Output token throughput (tok/s):         434.08
Peak output token throughput (tok/s):    500.00
Peak concurrent requests:                500
Total token throughput (tok/s):          14516.19
Concurrency:                             274.42
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40881.69
Median E2E Latency (ms):                 41759.84
P90 E2E Latency (ms):                    69137.59
P99 E2E Latency (ms):                    72521.94
---------------Time to First Token----------------
Mean TTFT (ms):                          0.00
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1379.12
Median TPOT (ms):                        605.74
P99 TPOT (ms):                           13373.51
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

hicache L2 only

request at least 640gb memory

python3 -m sglang.launch_server --model-path /data/models/GLM-4.7-FP8/ --served-model-name glm47 --trust-remote-code --page-size "128" --reasoning-parser glm45 --tool-call-parser glm47 --enable-metrics --collect-tokens-histogram  --enable-cache-report --host "0.0.0.0" --port 8000 --kv-cache-dtype fp8_e4m3 --mem-fraction-static "0.83" --max-running-requests "64" --max-prefill-tokens "24576" --chunked-prefill-size "32768" --tp-size "8" --enable-hierarchical-cache --hicache-size 80 --hicache-write-policy write_through --hicache-io-backend direct  --hicache-mem-layout layer_first --max-total-token 280000
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  122.38
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    8
Request throughput (req/s):              4.09
Input token throughput (tok/s):          8570.96
Output token throughput (tok/s):         264.20
Peak output token throughput (tok/s):    499.00
Peak concurrent requests:                500
Total token throughput (tok/s):          8835.16
Concurrency:                             292.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   71647.15
Median E2E Latency (ms):                 73111.48
P90 E2E Latency (ms):                    117442.14
P99 E2E Latency (ms):                    120417.76
---------------Time to First Token----------------
Mean TTFT (ms):                          235.33
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2444.20
Median TPOT (ms):                        1051.54
P99 TPOT (ms):                           23117.02
---------------Inter-Token Latency----------------
Mean ITL (ms):                           44.09
Median ITL (ms):                         43.97
P95 ITL (ms):                            44.72
P99 ITL (ms):                            44.83
Max ITL (ms):                            44.86
==================================================

populated:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     500
Benchmark duration (s):                  74.45
Total input tokens:                      1048956
Total input text tokens:                 1048956
Total generated tokens:                  32334
Total generated tokens (retokenized):    9
Request throughput (req/s):              6.72
Input token throughput (tok/s):          14089.16
Output token throughput (tok/s):         434.30
Peak output token throughput (tok/s):    499.00
Peak concurrent requests:                500
Total token throughput (tok/s):          14523.46
Concurrency:                             274.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40911.48
Median E2E Latency (ms):                 41672.92
P90 E2E Latency (ms):                    69527.26
P99 E2E Latency (ms):                    72407.62
---------------Time to First Token----------------
Mean TTFT (ms):                          139.29
Median TTFT (ms):                        0.00
P99 TTFT (ms):                           0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1380.84
Median TPOT (ms):                        608.65
P99 TPOT (ms):                           13474.01
---------------Inter-Token Latency----------------
Mean ITL (ms):                           43.79
Median ITL (ms):                         43.70
P95 ITL (ms):                            44.52
P99 ITL (ms):                            44.61
Max ITL (ms):                            44.63
==================================================s

Acc mmlu

First time

Total latency: 162.088 s
Score: 0.753
[METRIC] mmlu_score=0.7533333333333333 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
[METRIC] mmlu_latency=162.08847578521818 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
Writing report to /tmp/mmlu__data_models_GLM-4.7-FP8_.html
{'other': np.float64(0.8405797101449275), 'other:std': np.float64(0.36606756348739383), 'score:std': np.float64(0.4310710176087256), 'stem': np.float64(0.8769230769230769), 'stem:std': np.float64(0.32852548467788645), 'humanities': np.float64(0.5656565
656565656), 'humanities:std': np.float64(0.49567047056102215), 'social_sciences': np.float64(0.8208955223880597), 'social_sciences:std': np.float64(0.3834397784676158), 'score': np.float64(0.7533333333333333)}

after flush_cache

Total latency: 156.795 s
Score: 0.760
[METRIC] mmlu_score=0.76 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
[METRIC] mmlu_latency=156.79523031320423 labels={"model": "/data/models/GLM-4.7-FP8/", "eval": "mmlu"}
Writing report to /tmp/mmlu__data_models_GLM-4.7-FP8_.html
{'other': np.float64(0.855072463768116), 'other:std': np.float64(0.35202776236206146), 'score:std': np.float64(0.4270831300812525), 'stem': np.float64(0.8923076923076924), 'stem:std': np.float64(0.3099914104555367), 'humanities': np.float64(0.565656565
6565656), 'humanities:std': np.float64(0.49567047056102215), 'social_sciences': np.float64(0.8208955223880597), 'social_sciences:std': np.float64(0.3834397784676158), 'score': np.float64(0.76)}

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @jimmy-evo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Pegaflow, a new and advanced KV cache backend, into the system, significantly enhancing memory management and performance for large language models. Alongside this major feature, it refines the tracing mechanism to ensure better observability by propagating external trace contexts more effectively. Additionally, a crucial check for local model file integrity has been implemented to prevent runtime errors caused by incomplete Hugging Face model downloads.

Highlights

  • New KV Cache Backend: Introduced Pegaflow as an alternative radix cache backend, offering advanced features for KV cache management.
  • Enhanced Cache Capabilities: Pegaflow supports standalone KV cache offloading and sharing, layerwise transfer, asynchronous device-to-CPU copying, and memory-level acceleration (MLA).
  • Trace Context Propagation: Improved the propagation of external trace headers across Engine methods (generate, async_generate, encode, async_encode) and request input structures (GenerateReqInput, EmbeddingReqInput).
  • Model File Integrity Check: Added a utility to verify that all files referenced in .safetensors.index.json for Hugging Face models exist locally, preventing issues with incomplete model downloads.
  • Versioning Configuration: Updated pyproject.toml files to include a git_describe_command for setuptools_scm, which helps in more robust version determination from git tags.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/release-docker-npu.yml
    • .github/workflows/release-docker-xeon.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@jimmy-evo jimmy-evo force-pushed the feat/pegaflow_adapt branch from b79bf7f to cd4941b Compare January 16, 2026 11:20

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces pegaflow as a new radix cache backend, which can be enabled with the --enable-pegaflow flag. The implementation looks solid. Additionally, the PR includes several other improvements: it refactors trace context propagation for better clarity by using external_trace_header, and enhances the model loading process by adding checks for incomplete local snapshots, which improves robustness. The configuration for setuptools_scm is also updated.

I have one minor suggestion regarding a potential typo in the pegaflow import to improve naming consistency. Overall, these are great additions to the project.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/managers/scheduler.py (693-695)

medium

There seems to be a typo in the import path and class name. The feature is named "pegaflow", but here it's written as "peagflow". For consistency, I suggest renaming peagflow_radix_cache to pegaflow_radix_cache and PeagflowRadixCache to PegaflowRadixCache in the pegaflow library, and updating the import here accordingly.

                from pegaflow.sglang.pegaflow_radix_cache import PegaflowRadixCache

                self.tree_cache = PegaflowRadixCache(

@hzh0425

hzh0425 commented Jan 16, 2026

Copy link
Copy Markdown
Collaborator

Could you please paste the performance benchmark comparison results?

@jimmy-evo

Copy link
Copy Markdown
Contributor Author

/gemini summary

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jimmy-evo

jimmy-evo commented Jan 26, 2026

Copy link
Copy Markdown
Contributor Author

Could you please paste the performance benchmark comparison results?

@hzh0425
I have update PR with benchmark results, pegaflow can save a lot memory to cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file npu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants