[NPU][Feature] Support kvcache direct transmission between HBM and distributed kv pool by husf1130 · Pull Request #14056 · sgl-project/sglang

husf1130 · 2025-11-27T12:45:13Z

Motivation

In the kvcache pooling scenario, sglang prefix cache feature supports a 3-level caching strategy (L1: HBM, L2: DRAM, L3: distributed storage). For a new request, there are 2 prefix-match operations and 2 copy operations.
We explore 2-level caching strategy(L1 and L3) to simplify the process and reduce the overhead of prefix-match and copy, which only need one prefix-match operation and one copy operation.

Modifications

In this PR, we import HiRadixCacheDirect and CacheControllerDirect to support 2-level caching strategy. It will synchronously load kvcache into HBM from storage while making prefill-batch, and while the request finishing, it will asynchronously write kvcache into storage from HBM.
We import a new launch-parameter --enable-hierarchical-cache-direct to enable 2-level caching strategy.
For the 2-level caching strategy in disaggregation mode, the decode node supports kvcache offload to reuse kvcache in multi-turn case.

Accuracy Tests

Prefix Cache testing:
Since standard GSM8K requests are too short, they are not well-suited as a benchmark for testing prefix cache performance. We designed a simple script to evaluate the accuracy of the prefix cache. The script runs a 3.5k dataset with 100 samples twice in 2 sglang instances which share one memcache pool, both output length is 4, saves the generated tokens from both runs, and finally compares them to measure accuracy.

Below are our test results. The outputs from both runs were identical, and the execution time clearly indicates that the prefix cache was effective.

# the first instance results: 
first_ids=[323, 293, 2050, 54304, 1558, 432, 1896, 30, 2585, 1753, 11372, 1521, 220, 1260, 8473, 220, 2932, 6696, 279, 53523, 42, 3923, 277, 3937, 19, 3039, 438, 1657, 646, 4139, 220, 17, 43366, 9442, 13, 220, 19, 20, 4115, 419, 4843, 2254, 13, 2585, 6128, 892, 2783, 400, 16, 13, 20, 1817, 84250, 702, 220, 20, 18143, 48719, 15254, 30, 7627, 6649, 25, 30717, 279, 1378, 2849, 30, 264, 2003, 438, 264, 64117, 554, 3643, 264, 8849, 13, 2932, 3867, 2326, 2220, 333, 16896, 374, 1431, 220, 18, 389, 7589, 13, 5301, 15, 15, 5851, 30, 315, 279, 2311, 30, 11, 323, 1045, 14697, 279, 17438, 3589, 30, 220, 4636, 220, 21, 279, 8411, 13, 2585, 525, 279, 25236, 30, 1635, 504, 1431, 909, 13, 3555, 374, 862, 1899, 311, 4227, 323, 14961, 18762, 13, 2585, 1091, 4279, 315, 42570, 4420, 11, 566, 12205, 6278, 518, 220, 19, 78, 7289, 1558, 566, 220, 18, 4115, 279, 315, 4628, 3040, 3039, 748, 12167, 30, 93757, 279, 25105, 11, 369, 4429, 323, 34561, 862, 323, 498, 614, 2669, 13, 15, 15, 304, 94035, 311, 8239, 1817, 52001, 1573, 1340, 19383, 19818, 429, 2783, 400, 9666, 1521, 1340, 6851, 220, 23, 8153, 11, 389, 18805, 817, 2003, 16, 15, 8756, 817, 19724, 6467, 11, 1246, 220, 2585, 1753, 803, 17, 13, 2585, 1657, 17, 326, 965, 573, 311, 279, 9508, 30, 323, 279, 4287, 10855, 220, 18095, 1083, 3694, 323, 220, 21, 22, 17, 525, 20282, 323, 311, 633, 432, 52042, 9775, 27681, 387, 30, 315, 279, 1042, 875, 220, 19, 50122, 315, 311, 4845, 323, 15804, 2790, 11, 1246, 1657, 315, 976, 3253, 5356, 57937, 594, 2906, 13, 42214, 1191, 448, 30, 400, 16, 20, 13, 10917, 438, 1657, 11221, 1558, 39039, 4828, 30, 220, 18, 3951, 30, 323, 2148, 263, 3473, 9278, 35108, 686, 614, 311, 5395, 279, 41189, 34089, 42570, 323, 6798, 15, 476, 304, 14185, 4990, 279, 5562, 311, 19, 92866, 315, 3015, 23, 311, 56581, 264, 1091, 220, 19, 3039, 38025, 10779, 220, 18, 13, 2379, 2765, 220, 20161, 287, 2474, 566, 4843, 882, 11, 432, 553, 220, 16, 15, 5227, 4747, 4559, 30, 21, 4780, 13, 22062, 518, 279, 54462, 11, 13, 1416, 69331, 702, 13, 28693, 374, 220, 304, 220, 18, 23, 10856, 1372, 315, 6753, 14, 20, 525, 13007, 41298, 45398, 220, 24, 807, 9052, 4279, 862, 1033, 13214, 220, 18, 220, 16, 281, 15521] 
cost 71s

# the second instance results: 
second_ids=[323, 293, 2050, 54304, 1558, 432, 1896, 30, 2585, 1753, 11372, 1521, 220, 1260, 8473, 220, 2932, 6696, 279, 53523, 42, 3923, 277, 3937, 19, 3039, 438, 1657, 646, 4139, 220, 17, 43366, 9442, 13, 220, 19, 20, 4115, 419, 4843, 2254, 13, 2585, 6128, 892, 2783, 400, 16, 13, 20, 1817, 84250, 702, 220, 20, 18143, 48719, 15254, 30, 7627, 6649, 25, 30717, 279, 1378, 2849, 30, 264, 2003, 438, 264, 64117, 554, 3643, 264, 8849, 13, 2932, 3867, 2326, 2220, 333, 16896, 374, 1431, 220, 18, 389, 7589, 13, 5301, 15, 15, 5851, 30, 315, 279, 2311, 30, 11, 323, 1045, 14697, 279, 17438, 3589, 30, 220, 4636, 220, 21, 279, 8411, 13, 2585, 525, 279, 25236, 30, 1635, 504, 1431, 909, 13, 3555, 374, 862, 1899, 311, 4227, 323, 14961, 18762, 13, 2585, 1091, 4279, 315, 42570, 4420, 11, 566, 12205, 6278, 518, 220, 19, 78, 7289, 1558, 566, 220, 18, 4115, 279, 315, 4628, 3040, 3039, 748, 12167, 30, 93757, 279, 25105, 11, 369, 4429, 323, 34561, 862, 323, 498, 614, 2669, 13, 15, 15, 304, 94035, 311, 8239, 1817, 52001, 1573, 1340, 19383, 19818, 429, 2783, 400, 9666, 1521, 1340, 6851, 220, 23, 8153, 11, 389, 18805, 817, 2003, 16, 15, 8756, 817, 19724, 6467, 11, 1246, 220, 2585, 1753, 803, 17, 13, 2585, 1657, 17, 326, 965, 573, 311, 279, 9508, 30, 323, 279, 4287, 10855, 220, 18095, 1083, 3694, 323, 220, 21, 22, 17, 525, 20282, 323, 311, 633, 432, 52042, 9775, 27681, 387, 30, 315, 279, 1042, 875, 220, 19, 50122, 315, 311, 4845, 323, 15804, 2790, 11, 1246, 1657, 315, 976, 3253, 5356, 57937, 594, 2906, 13, 42214, 1191, 448, 30, 400, 16, 20, 13, 10917, 438, 1657, 11221, 1558, 39039, 4828, 30, 220, 18, 3951, 30, 323, 2148, 263, 3473, 9278, 35108, 686, 614, 311, 5395, 279, 41189, 34089, 42570, 323, 6798, 15, 476, 304, 14185, 4990, 279, 5562, 311, 19, 92866, 315, 3015, 23, 311, 56581, 264, 1091, 220, 19, 3039, 38025, 10779, 220, 18, 13, 2379, 2765, 220, 20161, 287, 2474, 566, 4843, 882, 11, 432, 553, 220, 16, 15, 5227, 4747, 4559, 30, 21, 4780, 13, 22062, 518, 279, 54462, 11, 13, 1416, 69331, 702, 13, 28693, 374, 220, 304, 220, 18, 23, 10856, 1372, 315, 6753, 14, 20, 525, 13007, 41298, 45398, 220, 24, 807, 9052, 4279, 862, 1033, 13214, 220, 18, 220, 16, 281, 15521] 
cost 22s


total_tokens=800 equal*2=800

We also test the gsm8k:
python test_accuracy_gsm8k.py

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [08:22<00:00,  5.02s/it]
Accuracy: 0.960
Invalid: 0.000
Latency: 507.977 s
Output throughput: 16.877 token/s
metrics={'accuracy': np.float64(0.96), 'invalid': np.float64(0.0), 'latency': 507.97703336001723, 'output_throughput': 16.876747248382152}
metrics['accuracy']=np.float64(0.96)

Benchmarking and Profiling

We mainly conducted tests on these two models — DeepSeek-R1_w8a8 and Qwen3-32B.

In summary, under our test cases, using 3.5k+1 dataset, compared to closing L1 and L3：
For Qwen3-32B, TTFT was reduced by 93.5%(100% reuse rate) and 46.6%(50% reuse rate) in A3 2 instance env.
For DeepSeek-R1_w8a8, TTFT was reduced by ~75%(100% reuse rate) and 39%(50% reuse rate) in A3 single instance env.

Below are the test process for DeepSeek R1:

Test method for DeepSeek R1:

# 1st, launch meta service of memcache
source /usr/local/memfabric_hybrid/set_env.sh
export MMC_META_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-meta.conf
/usr/local/memfabric_hybrid/latest/aarch64-linux/bin/mmc_meta_service &

# 2nd, launch sglang server with memcache
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-local.conf
python -m sglang.launch_server \
    --host 127.0.0.1 \
    --port 18001 \
    --context-length 3800 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --max-prefill-tokens 30400 \
    --chunked-prefill-size 57344 \
    --quantization w8a8_int8 \
    --model-path /data/DeepSeek-R1_w8a8/ \
    --tp-size 16 \
    --base-gpu-id 0 \
    --mem-fraction-static 0.85 \
    --log-level info \
    --enable-hierarchical-cache \
    --hicache-storage-backend memcache &

# 3rd, benchmark command
# run twice 100 reqs: construct cache and test 100% reuse
python3 -m sglang.bench_serving \
    --dataset-path /home/xxx/GSM8K-in3584-bs200-100.jsonl \
    --dataset-name gsm8k \
    --backend sglang \
    --host 127.0.0.1 \
    --port 18001 \
    --max-concurrency 8 \
    --random-output-len 1 \
    --random-input-len 3584 \
    --num-prompts 100


python3 -m sglang.bench_serving \
    --dataset-path /home/xxx/GSM8K-in3584-bs200-100.jsonl \
    --dataset-name gsm8k \
    --backend sglang \
    --host 127.0.0.1 \
    --port 18001 \
    --max-concurrency 8 \
    --random-output-len 1 \
    --random-input-len 3584 \
    --num-prompts 100


# bench run 200 reqs， test 50% reuse rate
python3 -m sglang.bench_serving \
    --dataset-path /home/xxx/GSM8K-in3584-bs200-100.jsonl \
    --dataset-name gsm8k \
    --backend sglang \
    --host 127.0.0.1 \
    --port 18001 \
    --max-concurrency 8 \
    --random-output-len 1 \
    --random-input-len 3584 \
    --num-prompts 200

Test results for DeepSeek R1 50% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  23.49
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              4.26
Input token throughput (tok/s):          65688.91
Output token throughput (tok/s):         4.26
Total token throughput (tok/s):          65693.16
Concurrency:                             7.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1801.03
Median E2E Latency (ms):                 1593.83
---------------Time to First Token----------------
Mean TTFT (ms):                          1801.02
Median TTFT (ms):                        1593.82
P99 TTFT (ms):                           2988.38
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Test results for DeepSeek R1 100% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  9.37
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              10.67
Input token throughput (tok/s):          164691.04
Output token throughput (tok/s):         10.67
Total token throughput (tok/s):          164701.72
Concurrency:                             7.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   718.89
Median E2E Latency (ms):                 733.09
---------------Time to First Token----------------
Mean TTFT (ms):                          718.88
Median TTFT (ms):                        733.08
P99 TTFT (ms):                           848.01
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

**Baseline data closed L1 and L3 for DeepSeek R1: **

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  73.51
Total input tokens:                      3085580
Total generated tokens:                  200
Total generated tokens (retokenized):    200
Request throughput (req/s):              2.72
Input token throughput (tok/s):          41973.67
Output token throughput (tok/s):         2.72
Total token throughput (tok/s):          41976.40
Concurrency:                             7.96
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2925.45
Median E2E Latency (ms):                 2905.60
---------------Time to First Token----------------
Mean TTFT (ms):                          2925.44
Median TTFT (ms):                        2905.59
P99 TTFT (ms):                           3686.25
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Below are the test process for Qwen3-32B:

Test method for Qwen3-32B:

# 1st, launch meta service of memcache
source /usr/local/memfabric_hybrid/set_env.sh
export MMC_META_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-meta.conf
/usr/local/memfabric_hybrid/latest/aarch64-linux/bin/mmc_meta_service &

# 2nd, launch sglang-1 with memcache
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export PYTHONPATH=/home/xxx/sglang/python/:$PYTHONPATH
export MMC_LOCAL_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-local.conf
python3 -m sglang.launch_server \
    --model-path /data/Qwen3-32B \
    --host 127.0.0.1 \
    --port 18001 \
    --trust-remote-code \
    --tp-size 2 \
    --mem-fraction-static 0.85 \
    --base-gpu-id 12 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --log-level debug \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --enable-hierarchical-cache \
    --hicache-storage-backend memcache &

# 3nd, launch sglang-1 with memcache
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export PYTHONPATH=/home/xxx/sglang/python/:$PYTHONPATH
export MMC_LOCAL_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-local.conf
python3 -m sglang.launch_server \
    --model-path /data/Qwen3-32B \
    --host 127.0.0.1 \
    --port 28002 \
    --trust-remote-code \
    --tp-size 2 \
    --mem-fraction-static 0.85 \
    --base-gpu-id 14 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --log-level debug \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --enable-hierarchical-cache \
    --hicache-storage-backend memcache &

Test results for Qwen3-32B 50% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  31.75
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              3.15
Input token throughput (tok/s):          48585.79
Output token throughput (tok/s):         3.15
Total token throughput (tok/s):          48588.94
Concurrency:                             7.65
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2430.58
Median E2E Latency (ms):                 2259.21
---------------Time to First Token----------------
Mean TTFT (ms):                          2430.58
Median TTFT (ms):                        2259.20
P99 TTFT (ms):                           4819.60
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Test results for Qwen3-32B 100% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  3.85
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              25.98
Input token throughput (tok/s):          400786.92
Output token throughput (tok/s):         25.98
Total token throughput (tok/s):          400812.90
Concurrency:                             7.66
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   294.95
Median E2E Latency (ms):                 288.79
---------------Time to First Token----------------
Mean TTFT (ms):                          294.94
Median TTFT (ms):                        288.79
P99 TTFT (ms):                           404.74
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Baseline data closed L1 and L3 for Qwen3-32B:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  58.22
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              1.72
Input token throughput (tok/s):          26499.73
Output token throughput (tok/s):         1.72
Total token throughput (tok/s):          26501.45
Concurrency:                             7.82
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4550.70
Median E2E Latency (ms):                 4660.88
---------------Time to First Token----------------
Mean TTFT (ms):                          4550.69
Median TTFT (ms):                        4660.87
P99 TTFT (ms):                           4708.71
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-27T12:45:44Z

Summary of Changes

Hello @husf1130, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's KV cache capabilities by introducing a distributed L1+L3 hierarchical caching system optimized for Ascend devices. It integrates Ascend's MemCache as a high-performance L3 storage layer, complete with a dedicated controller and radix cache implementation. The changes enable more efficient memory management and data transfer between device memory and the L3 cache, aiming to improve overall inference performance on Ascend hardware.

Highlights

Ascend L1+L3 KVCache Support: Introduced comprehensive support for a distributed L1+L3 KVCache system specifically tailored for Ascend devices, leveraging MemCache as the L3 storage backend.
New Ascend-Specific Components: Added AscendHiCacheController and AscendHiRadixCache to manage and integrate the hierarchical caching mechanism for Ascend hardware, extending the existing RadixCache functionality.
MemCache Storage Backend: Implemented AscendMemCacheStore as a new storage backend, enabling SGLang to utilize Ascend's MemCache for efficient L3 KV cache operations, including distributed read/write capabilities.
Scheduler and Cache Management Integration: Modified the scheduler and cache management policies to incorporate the new hierarchical cache logic, including conditional loading, prefix matching, and buffer handling for Ascend devices.
Documentation for MemCache Setup: Included a detailed README.md file outlining the steps to build, install, and configure Ascend MemCache for use with SGLang, covering both MetaService and LocalService setup.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for L1+L3 distributed KVCache on Ascend using memcache, which is a significant new feature. The changes are extensive, touching the cache controllers, scheduling policies, and storage backends, and adding new components like AscendHiCacheController and AscendHiRadixCache for Ascend-specific logic. The overall implementation looks solid. However, I've identified a couple of critical issues that need attention. One is in the eviction logic of AscendHiRadixCache, where backed-up nodes are not being evicted from device memory, potentially leading to memory exhaustion. Another is in AscendMemCacheStore, where the return value of batch_set is inverted, which would cause the caller to misinterpret success as failure. I've also included a minor typing suggestion to improve code quality. Addressing these points will make the PR ready for merging.

hzh0425 · 2025-11-28T02:19:06Z

Hi, why do we need to add a new cache controller and a hiradix tree?

husf1130 · 2025-11-28T08:02:06Z

Hi, why do we need to add a new cache controller and a hiradix tree?

Firstly, thanks for following this PR.

In the kvcache pooling scenario, sglang prefix cache feature supports a 3-level caching strategy (L1: HBM, L2: DRAM, L3: distributed storage). For a new request, there are 2 prefix-match operations and 2 copy operations.
We explore 2-level caching strategy(L1 and L3) to simplify the process and reduce the overhead of prefix-match and copy, which only need one prefix-match operation and one copy operation.

We import HiRadixCacheDirect and CacheControllerDirect to support 2-level caching strategy. It will synchronously load kvcache into HBM from storage while making prefill-batch, and while the request finishing, it will write kvcache into storage from HBM.

Based on ascend memcache store, we implemented new processes and achieved performance improvements.

* support decode node kvcache offload for enable_hierarchical_cache_direct * support kvcache offload for decode node

Cesilina · 2026-03-17T08:40:56Z

Hi @husf1130 ! The changes are encouraging, I had similar ideas in the context of using MoonCake as L3, but they rest on the fact that now in the current implementation of HiCache, two buffers are being assembled from 2 * page_size * layers of buffers in the host memory. I took your code, adapted it to interact with the MoonCake Store, and received infiniband link flaps under load due to the large number of micro-operations. If you completely abandon the local host memory for the worker, such a grouping should probably be carried out in a separate GPU buffer, sacrificing the size of L1.

I tried adding packing and unpacking by pages in a buffer allocated on the gpu. In the first version, without the triton kernel, just using pytorch. (buffer size ~ 1 batch, 3 gb in my setup)

I can share my experiments. Experiment setup:

8 workers with Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 with TP1

1 host with 8xH100

sglang bench serving with mooncake dataset (num_rounds=1, slowdown factor=1, max input context ~128k tokens)

Page size 128

In the L1+L2+L3 scenario, the size of L2 is x2 of L1. Size L3 so that L2+L3 ~ 1 Tb

In the L1+L3 scenario, the size of L3 is ~1Tb

I am glad that we managed to cut the p99 TTFT. I assume that the avg. ttft degradation can be overcome with little effort. For example, after completing the run, I found that it seems that io-operations in the new direct controller are performed in the same scheduler thread, which could worsen the result, I will retest with the changes.
In general, I got the best results in this experiment when combining the approx kv-cache-aware router (from Nvidia Dynamo) with the old version L1+L2+L3:
There is an intuition that with full optimization of the L1+L3 approach, as well as integration with kv-cache-aware routing (L1 + over hosts for L3 to access the cache inside the host), the best result will be obtained.
How do you look to split this PR into 2 - with a new controller and an ascend-specific part? Then I could join the addition of the packing part and integration with MoonCake.

Have you try multi-nodes with this relevant exps ?

vladnosiv · 2026-03-17T08:51:23Z

Have you try multi-nodes with this relevant exps ?

Hi !
I didn't continue the experiments related to the complete rejection of the L2 cache and instead made a recent PR with the L2 mode as a buffer for changing layout - #20535 , which reduces its size to several gb per worker.

Cesilina · 2026-03-17T09:28:28Z

Have you try multi-nodes with this relevant exps ?

Hi ! I didn't continue the experiments related to the complete rejection of the L2 cache and instead made a recent PR with the L2 mode as a buffer for changing layout - #20535 , which reduces its size to several gb per worker.

Sounds like a solid plan. Good job on the optimization! I’ll go review PR #20535 now."

husf1130 requested review from Ying1123, hnyls2002, merrymercy, xiezhq-hermann and zhyncs as code owners November 27, 2025 12:45

github-actions Bot added documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang npu labels Nov 27, 2025

gemini-code-assist Bot reviewed Nov 27, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/ascend_radix_cache.py

Comment thread python/sglang/srt/mem_cache/storage/ascend_store/ascend_memcache_store.py Outdated

Comment thread python/sglang/srt/mem_cache/ascend_cache_controller.py Outdated

husf1130 changed the title ~~[Feature] Support L1+L3 distributed KVCache(memcache) on Ascend~~ [WIP][Feature] Support L1+L3 distributed KVCache(memcache) on Ascend Nov 27, 2025

husf1130 force-pushed the br_l3_memcache_v2 branch 2 times, most recently from 8f0f7c2 to de4e5e6 Compare November 27, 2025 14:29

husf1130 changed the title ~~[WIP][Feature] Support L1+L3 distributed KVCache(memcache) on Ascend~~ [WIP][Feature] Support L1+L3 distributed KVCache pool(memcache) on Ascend Nov 28, 2025

husf1130 force-pushed the br_l3_memcache_v2 branch 3 times, most recently from 62bc6f3 to 7a58455 Compare November 29, 2025 02:22

husf1130 changed the title ~~[WIP][Feature] Support L1+L3 distributed KVCache pool(memcache) on Ascend~~ [Feature] Support L1+L3 distributed KVCache pool(memcache) on Ascend Nov 29, 2025

husf1130 force-pushed the br_l3_memcache_v2 branch 4 times, most recently from 1d84277 to 3b9495c Compare December 1, 2025 09:48

iforgetmyname reviewed Dec 1, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/hicache_storage.py Outdated

Comment thread python/sglang/srt/mem_cache/hicache_storage.py Outdated

Comment thread python/sglang/srt/mem_cache/hicache_storage.py Outdated

Comment thread python/sglang/srt/mem_cache/ascend_radix_cache.py Outdated

husf1130 force-pushed the br_l3_memcache_v2 branch 5 times, most recently from f315186 to be359fc Compare December 2, 2025 05:02

husf1130 and others added 12 commits December 16, 2025 14:11

decode instance can be used to expand storage capacity

9f3982a

Merge branch 'main' into br_l3_memcache_v2

0830da4

improve logs

71190e5

add async thread to write from hbm into l3

e73b2e0

Merge branch 'main' into br_l3_memcache_v2

408edef

clean codes

9e19328

Merge branch 'main' into br_l3_memcache_v2

cf2fac5

Merge branch 'main' into br_l3_memcache_v2

ba38c2d

improve codes

c2fc244

Merge branch 'main' into br_l3_memcache_v2

3289b84

Merge branch 'main' into br_l3_memcache_v2

d4cada8

support kvcache offload for decode node (#2)

ecd4a99

* support decode node kvcache offload for enable_hierarchical_cache_direct * support kvcache offload for decode node

husf1130 requested review from ByronHsu and ShangmingCai as code owners January 4, 2026 14:06

husf1130 added 7 commits January 4, 2026 22:12

improve codes

b8349d0

improve codes

a6bcc98

Merge branch 'main' into br_l3_memcache_v2

637f096

Merge branch 'main' into br_l3_memcache_v2

0431786

merge main

c83e586

Merge branch 'main' into br_l3_memcache_v2

f14bea9

Merge branch 'main' into br_l3_memcache_v2

0f3d59d

iforgetmyname mentioned this pull request Jan 23, 2026

[Roadmap] Ascend NPU Development (2026 Q1) #13664

Open

28 tasks

husf and others added 5 commits January 26, 2026 11:10

Merge branch 'main' into br_l3_memcache_v2

bbcf863

Merge branch 'main' into br_l3_memcache_v2

218636d

merge main

c2e5c51

Merge branch 'main' into br_l3_memcache_v2

d313dc8

Merge branch 'main' into br_l3_memcache_v2

54a3b2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU][Feature] Support kvcache direct transmission between HBM and distributed kv pool#14056

[NPU][Feature] Support kvcache direct transmission between HBM and distributed kv pool#14056
husf1130 wants to merge 31 commits intosgl-project:mainfrom
husf1130:br_l3_memcache_v2

husf1130 commented Nov 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Nov 27, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzh0425 commented Nov 28, 2025

Uh oh!

husf1130 commented Nov 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cesilina commented Mar 17, 2026

Uh oh!

vladnosiv commented Mar 17, 2026

Uh oh!

Cesilina commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

husf1130 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzh0425 commented Nov 28, 2025

Uh oh!

husf1130 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cesilina commented Mar 17, 2026

Uh oh!

vladnosiv commented Mar 17, 2026

Uh oh!

Cesilina commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

husf1130 commented Nov 27, 2025 •

edited

Loading

husf1130 commented Nov 28, 2025 •

edited

Loading