Skip to content

[NPU][Feature] Support kvcache direct transmission between HBM and distributed kv pool#14056

Open
husf1130 wants to merge 31 commits intosgl-project:mainfrom
husf1130:br_l3_memcache_v2
Open

[NPU][Feature] Support kvcache direct transmission between HBM and distributed kv pool#14056
husf1130 wants to merge 31 commits intosgl-project:mainfrom
husf1130:br_l3_memcache_v2

Conversation

@husf1130
Copy link
Copy Markdown
Contributor

@husf1130 husf1130 commented Nov 27, 2025

Motivation

In the kvcache pooling scenario, sglang prefix cache feature supports a 3-level caching strategy (L1: HBM, L2: DRAM, L3: distributed storage). For a new request, there are 2 prefix-match operations and 2 copy operations.
We explore 2-level caching strategy(L1 and L3) to simplify the process and reduce the overhead of prefix-match and copy, which only need one prefix-match operation and one copy operation.

Modifications

  1. In this PR, we import HiRadixCacheDirect and CacheControllerDirect to support 2-level caching strategy. It will synchronously load kvcache into HBM from storage while making prefill-batch, and while the request finishing, it will asynchronously write kvcache into storage from HBM.
  2. We import a new launch-parameter --enable-hierarchical-cache-direct to enable 2-level caching strategy.
  3. For the 2-level caching strategy in disaggregation mode, the decode node supports kvcache offload to reuse kvcache in multi-turn case.

Upgraded Hierarchical Cache Design in SGLang v1

Accuracy Tests

Prefix Cache testing:
Since standard GSM8K requests are too short, they are not well-suited as a benchmark for testing prefix cache performance. We designed a simple script to evaluate the accuracy of the prefix cache. The script runs a 3.5k dataset with 100 samples twice in 2 sglang instances which share one memcache pool, both output length is 4, saves the generated tokens from both runs, and finally compares them to measure accuracy.

Below are our test results. The outputs from both runs were identical, and the execution time clearly indicates that the prefix cache was effective.

# the first instance results: 
first_ids=[323, 293, 2050, 54304, 1558, 432, 1896, 30, 2585, 1753, 11372, 1521, 220, 1260, 8473, 220, 2932, 6696, 279, 53523, 42, 3923, 277, 3937, 19, 3039, 438, 1657, 646, 4139, 220, 17, 43366, 9442, 13, 220, 19, 20, 4115, 419, 4843, 2254, 13, 2585, 6128, 892, 2783, 400, 16, 13, 20, 1817, 84250, 702, 220, 20, 18143, 48719, 15254, 30, 7627, 6649, 25, 30717, 279, 1378, 2849, 30, 264, 2003, 438, 264, 64117, 554, 3643, 264, 8849, 13, 2932, 3867, 2326, 2220, 333, 16896, 374, 1431, 220, 18, 389, 7589, 13, 5301, 15, 15, 5851, 30, 315, 279, 2311, 30, 11, 323, 1045, 14697, 279, 17438, 3589, 30, 220, 4636, 220, 21, 279, 8411, 13, 2585, 525, 279, 25236, 30, 1635, 504, 1431, 909, 13, 3555, 374, 862, 1899, 311, 4227, 323, 14961, 18762, 13, 2585, 1091, 4279, 315, 42570, 4420, 11, 566, 12205, 6278, 518, 220, 19, 78, 7289, 1558, 566, 220, 18, 4115, 279, 315, 4628, 3040, 3039, 748, 12167, 30, 93757, 279, 25105, 11, 369, 4429, 323, 34561, 862, 323, 498, 614, 2669, 13, 15, 15, 304, 94035, 311, 8239, 1817, 52001, 1573, 1340, 19383, 19818, 429, 2783, 400, 9666, 1521, 1340, 6851, 220, 23, 8153, 11, 389, 18805, 817, 2003, 16, 15, 8756, 817, 19724, 6467, 11, 1246, 220, 2585, 1753, 803, 17, 13, 2585, 1657, 17, 326, 965, 573, 311, 279, 9508, 30, 323, 279, 4287, 10855, 220, 18095, 1083, 3694, 323, 220, 21, 22, 17, 525, 20282, 323, 311, 633, 432, 52042, 9775, 27681, 387, 30, 315, 279, 1042, 875, 220, 19, 50122, 315, 311, 4845, 323, 15804, 2790, 11, 1246, 1657, 315, 976, 3253, 5356, 57937, 594, 2906, 13, 42214, 1191, 448, 30, 400, 16, 20, 13, 10917, 438, 1657, 11221, 1558, 39039, 4828, 30, 220, 18, 3951, 30, 323, 2148, 263, 3473, 9278, 35108, 686, 614, 311, 5395, 279, 41189, 34089, 42570, 323, 6798, 15, 476, 304, 14185, 4990, 279, 5562, 311, 19, 92866, 315, 3015, 23, 311, 56581, 264, 1091, 220, 19, 3039, 38025, 10779, 220, 18, 13, 2379, 2765, 220, 20161, 287, 2474, 566, 4843, 882, 11, 432, 553, 220, 16, 15, 5227, 4747, 4559, 30, 21, 4780, 13, 22062, 518, 279, 54462, 11, 13, 1416, 69331, 702, 13, 28693, 374, 220, 304, 220, 18, 23, 10856, 1372, 315, 6753, 14, 20, 525, 13007, 41298, 45398, 220, 24, 807, 9052, 4279, 862, 1033, 13214, 220, 18, 220, 16, 281, 15521] 
cost 71s

# the second instance results: 
second_ids=[323, 293, 2050, 54304, 1558, 432, 1896, 30, 2585, 1753, 11372, 1521, 220, 1260, 8473, 220, 2932, 6696, 279, 53523, 42, 3923, 277, 3937, 19, 3039, 438, 1657, 646, 4139, 220, 17, 43366, 9442, 13, 220, 19, 20, 4115, 419, 4843, 2254, 13, 2585, 6128, 892, 2783, 400, 16, 13, 20, 1817, 84250, 702, 220, 20, 18143, 48719, 15254, 30, 7627, 6649, 25, 30717, 279, 1378, 2849, 30, 264, 2003, 438, 264, 64117, 554, 3643, 264, 8849, 13, 2932, 3867, 2326, 2220, 333, 16896, 374, 1431, 220, 18, 389, 7589, 13, 5301, 15, 15, 5851, 30, 315, 279, 2311, 30, 11, 323, 1045, 14697, 279, 17438, 3589, 30, 220, 4636, 220, 21, 279, 8411, 13, 2585, 525, 279, 25236, 30, 1635, 504, 1431, 909, 13, 3555, 374, 862, 1899, 311, 4227, 323, 14961, 18762, 13, 2585, 1091, 4279, 315, 42570, 4420, 11, 566, 12205, 6278, 518, 220, 19, 78, 7289, 1558, 566, 220, 18, 4115, 279, 315, 4628, 3040, 3039, 748, 12167, 30, 93757, 279, 25105, 11, 369, 4429, 323, 34561, 862, 323, 498, 614, 2669, 13, 15, 15, 304, 94035, 311, 8239, 1817, 52001, 1573, 1340, 19383, 19818, 429, 2783, 400, 9666, 1521, 1340, 6851, 220, 23, 8153, 11, 389, 18805, 817, 2003, 16, 15, 8756, 817, 19724, 6467, 11, 1246, 220, 2585, 1753, 803, 17, 13, 2585, 1657, 17, 326, 965, 573, 311, 279, 9508, 30, 323, 279, 4287, 10855, 220, 18095, 1083, 3694, 323, 220, 21, 22, 17, 525, 20282, 323, 311, 633, 432, 52042, 9775, 27681, 387, 30, 315, 279, 1042, 875, 220, 19, 50122, 315, 311, 4845, 323, 15804, 2790, 11, 1246, 1657, 315, 976, 3253, 5356, 57937, 594, 2906, 13, 42214, 1191, 448, 30, 400, 16, 20, 13, 10917, 438, 1657, 11221, 1558, 39039, 4828, 30, 220, 18, 3951, 30, 323, 2148, 263, 3473, 9278, 35108, 686, 614, 311, 5395, 279, 41189, 34089, 42570, 323, 6798, 15, 476, 304, 14185, 4990, 279, 5562, 311, 19, 92866, 315, 3015, 23, 311, 56581, 264, 1091, 220, 19, 3039, 38025, 10779, 220, 18, 13, 2379, 2765, 220, 20161, 287, 2474, 566, 4843, 882, 11, 432, 553, 220, 16, 15, 5227, 4747, 4559, 30, 21, 4780, 13, 22062, 518, 279, 54462, 11, 13, 1416, 69331, 702, 13, 28693, 374, 220, 304, 220, 18, 23, 10856, 1372, 315, 6753, 14, 20, 525, 13007, 41298, 45398, 220, 24, 807, 9052, 4279, 862, 1033, 13214, 220, 18, 220, 16, 281, 15521] 
cost 22s


total_tokens=800 equal*2=800

We also test the gsm8k:
python test_accuracy_gsm8k.py

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [08:22<00:00,  5.02s/it]
Accuracy: 0.960
Invalid: 0.000
Latency: 507.977 s
Output throughput: 16.877 token/s
metrics={'accuracy': np.float64(0.96), 'invalid': np.float64(0.0), 'latency': 507.97703336001723, 'output_throughput': 16.876747248382152}
metrics['accuracy']=np.float64(0.96)

Benchmarking and Profiling

We mainly conducted tests on these two models — DeepSeek-R1_w8a8 and Qwen3-32B.

In summary, under our test cases, using 3.5k+1 dataset, compared to closing L1 and L3:
For Qwen3-32B, TTFT was reduced by 93.5%(100% reuse rate) and 46.6%(50% reuse rate) in A3 2 instance env.
For DeepSeek-R1_w8a8, TTFT was reduced by ~75%(100% reuse rate) and 39%(50% reuse rate) in A3 single instance env.

Below are the test process for DeepSeek R1:

Test method for DeepSeek R1:

# 1st, launch meta service of memcache
source /usr/local/memfabric_hybrid/set_env.sh
export MMC_META_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-meta.conf
/usr/local/memfabric_hybrid/latest/aarch64-linux/bin/mmc_meta_service &

# 2nd, launch sglang server with memcache
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-local.conf
python -m sglang.launch_server \
    --host 127.0.0.1 \
    --port 18001 \
    --context-length 3800 \
    --trust-remote-code \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --max-prefill-tokens 30400 \
    --chunked-prefill-size 57344 \
    --quantization w8a8_int8 \
    --model-path /data/DeepSeek-R1_w8a8/ \
    --tp-size 16 \
    --base-gpu-id 0 \
    --mem-fraction-static 0.85 \
    --log-level info \
    --enable-hierarchical-cache \
    --hicache-storage-backend memcache &

# 3rd, benchmark command
# run twice 100 reqs: construct cache and test 100% reuse
python3 -m sglang.bench_serving \
    --dataset-path /home/xxx/GSM8K-in3584-bs200-100.jsonl \
    --dataset-name gsm8k \
    --backend sglang \
    --host 127.0.0.1 \
    --port 18001 \
    --max-concurrency 8 \
    --random-output-len 1 \
    --random-input-len 3584 \
    --num-prompts 100


python3 -m sglang.bench_serving \
    --dataset-path /home/xxx/GSM8K-in3584-bs200-100.jsonl \
    --dataset-name gsm8k \
    --backend sglang \
    --host 127.0.0.1 \
    --port 18001 \
    --max-concurrency 8 \
    --random-output-len 1 \
    --random-input-len 3584 \
    --num-prompts 100


# bench run 200 reqs, test 50% reuse rate
python3 -m sglang.bench_serving \
    --dataset-path /home/xxx/GSM8K-in3584-bs200-100.jsonl \
    --dataset-name gsm8k \
    --backend sglang \
    --host 127.0.0.1 \
    --port 18001 \
    --max-concurrency 8 \
    --random-output-len 1 \
    --random-input-len 3584 \
    --num-prompts 200

Test results for DeepSeek R1 50% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  23.49
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              4.26
Input token throughput (tok/s):          65688.91
Output token throughput (tok/s):         4.26
Total token throughput (tok/s):          65693.16
Concurrency:                             7.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1801.03
Median E2E Latency (ms):                 1593.83
---------------Time to First Token----------------
Mean TTFT (ms):                          1801.02
Median TTFT (ms):                        1593.82
P99 TTFT (ms):                           2988.38
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Test results for DeepSeek R1 100% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  9.37
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              10.67
Input token throughput (tok/s):          164691.04
Output token throughput (tok/s):         10.67
Total token throughput (tok/s):          164701.72
Concurrency:                             7.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   718.89
Median E2E Latency (ms):                 733.09
---------------Time to First Token----------------
Mean TTFT (ms):                          718.88
Median TTFT (ms):                        733.08
P99 TTFT (ms):                           848.01
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

**Baseline data closed L1 and L3 for DeepSeek R1: **

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  73.51
Total input tokens:                      3085580
Total generated tokens:                  200
Total generated tokens (retokenized):    200
Request throughput (req/s):              2.72
Input token throughput (tok/s):          41973.67
Output token throughput (tok/s):         2.72
Total token throughput (tok/s):          41976.40
Concurrency:                             7.96
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2925.45
Median E2E Latency (ms):                 2905.60
---------------Time to First Token----------------
Mean TTFT (ms):                          2925.44
Median TTFT (ms):                        2905.59
P99 TTFT (ms):                           3686.25
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Below are the test process for Qwen3-32B:

Test method for Qwen3-32B:

# 1st, launch meta service of memcache
source /usr/local/memfabric_hybrid/set_env.sh
export MMC_META_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-meta.conf
/usr/local/memfabric_hybrid/latest/aarch64-linux/bin/mmc_meta_service &

# 2nd, launch sglang-1 with memcache
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export PYTHONPATH=/home/xxx/sglang/python/:$PYTHONPATH
export MMC_LOCAL_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-local.conf
python3 -m sglang.launch_server \
    --model-path /data/Qwen3-32B \
    --host 127.0.0.1 \
    --port 18001 \
    --trust-remote-code \
    --tp-size 2 \
    --mem-fraction-static 0.85 \
    --base-gpu-id 12 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --log-level debug \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --enable-hierarchical-cache \
    --hicache-storage-backend memcache &

# 3nd, launch sglang-1 with memcache
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export PYTHONPATH=/home/xxx/sglang/python/:$PYTHONPATH
export MMC_LOCAL_CONFIG_PATH=/usr/local/memfabric_hybrid/latest/config/mmc-local.conf
python3 -m sglang.launch_server \
    --model-path /data/Qwen3-32B \
    --host 127.0.0.1 \
    --port 28002 \
    --trust-remote-code \
    --tp-size 2 \
    --mem-fraction-static 0.85 \
    --base-gpu-id 14 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --log-level debug \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --enable-hierarchical-cache \
    --hicache-storage-backend memcache &

Test results for Qwen3-32B 50% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  31.75
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              3.15
Input token throughput (tok/s):          48585.79
Output token throughput (tok/s):         3.15
Total token throughput (tok/s):          48588.94
Concurrency:                             7.65
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2430.58
Median E2E Latency (ms):                 2259.21
---------------Time to First Token----------------
Mean TTFT (ms):                          2430.58
Median TTFT (ms):                        2259.20
P99 TTFT (ms):                           4819.60
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Test results for Qwen3-32B 100% Reuse Rate:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  3.85
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              25.98
Input token throughput (tok/s):          400786.92
Output token throughput (tok/s):         25.98
Total token throughput (tok/s):          400812.90
Concurrency:                             7.66
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   294.95
Median E2E Latency (ms):                 288.79
---------------Time to First Token----------------
Mean TTFT (ms):                          294.94
Median TTFT (ms):                        288.79
P99 TTFT (ms):                           404.74
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Baseline data closed L1 and L3 for Qwen3-32B:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 8
Successful requests:                     100
Benchmark duration (s):                  58.22
Total input tokens:                      1542795
Total generated tokens:                  100
Total generated tokens (retokenized):    100
Request throughput (req/s):              1.72
Input token throughput (tok/s):          26499.73
Output token throughput (tok/s):         1.72
Total token throughput (tok/s):          26501.45
Concurrency:                             7.82
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4550.70
Median E2E Latency (ms):                 4660.88
---------------Time to First Token----------------
Mean TTFT (ms):                          4550.69
Median TTFT (ms):                        4660.87
P99 TTFT (ms):                           4708.71
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Checklist

@github-actions github-actions Bot added documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang npu labels Nov 27, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @husf1130, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's KV cache capabilities by introducing a distributed L1+L3 hierarchical caching system optimized for Ascend devices. It integrates Ascend's MemCache as a high-performance L3 storage layer, complete with a dedicated controller and radix cache implementation. The changes enable more efficient memory management and data transfer between device memory and the L3 cache, aiming to improve overall inference performance on Ascend hardware.

Highlights

  • Ascend L1+L3 KVCache Support: Introduced comprehensive support for a distributed L1+L3 KVCache system specifically tailored for Ascend devices, leveraging MemCache as the L3 storage backend.
  • New Ascend-Specific Components: Added AscendHiCacheController and AscendHiRadixCache to manage and integrate the hierarchical caching mechanism for Ascend hardware, extending the existing RadixCache functionality.
  • MemCache Storage Backend: Implemented AscendMemCacheStore as a new storage backend, enabling SGLang to utilize Ascend's MemCache for efficient L3 KV cache operations, including distributed read/write capabilities.
  • Scheduler and Cache Management Integration: Modified the scheduler and cache management policies to incorporate the new hierarchical cache logic, including conditional loading, prefix matching, and buffer handling for Ascend devices.
  • Documentation for MemCache Setup: Included a detailed README.md file outlining the steps to build, install, and configure Ascend MemCache for use with SGLang, covering both MetaService and LocalService setup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for L1+L3 distributed KVCache on Ascend using memcache, which is a significant new feature. The changes are extensive, touching the cache controllers, scheduling policies, and storage backends, and adding new components like AscendHiCacheController and AscendHiRadixCache for Ascend-specific logic. The overall implementation looks solid. However, I've identified a couple of critical issues that need attention. One is in the eviction logic of AscendHiRadixCache, where backed-up nodes are not being evicted from device memory, potentially leading to memory exhaustion. Another is in AscendMemCacheStore, where the return value of batch_set is inverted, which would cause the caller to misinterpret success as failure. I've also included a minor typing suggestion to improve code quality. Addressing these points will make the PR ready for merging.

Comment thread python/sglang/srt/mem_cache/ascend_radix_cache.py
Comment thread python/sglang/srt/mem_cache/storage/ascend_store/ascend_memcache_store.py Outdated
Comment thread python/sglang/srt/mem_cache/ascend_cache_controller.py Outdated
@husf1130 husf1130 changed the title [Feature] Support L1+L3 distributed KVCache(memcache) on Ascend [WIP][Feature] Support L1+L3 distributed KVCache(memcache) on Ascend Nov 27, 2025
@husf1130 husf1130 force-pushed the br_l3_memcache_v2 branch 2 times, most recently from 8f0f7c2 to de4e5e6 Compare November 27, 2025 14:29
@hzh0425
Copy link
Copy Markdown
Collaborator

hzh0425 commented Nov 28, 2025

Hi, why do we need to add a new cache controller and a hiradix tree?

@husf1130
Copy link
Copy Markdown
Contributor Author

husf1130 commented Nov 28, 2025

Hi, why do we need to add a new cache controller and a hiradix tree?

Firstly, thanks for following this PR.

In the kvcache pooling scenario, sglang prefix cache feature supports a 3-level caching strategy (L1: HBM, L2: DRAM, L3: distributed storage). For a new request, there are 2 prefix-match operations and 2 copy operations.
We explore 2-level caching strategy(L1 and L3) to simplify the process and reduce the overhead of prefix-match and copy, which only need one prefix-match operation and one copy operation.

We import HiRadixCacheDirect and CacheControllerDirect to support 2-level caching strategy. It will synchronously load kvcache into HBM from storage while making prefill-batch, and while the request finishing, it will write kvcache into storage from HBM.

Based on ascend memcache store, we implemented new processes and achieved performance improvements.

@husf1130 husf1130 changed the title [WIP][Feature] Support L1+L3 distributed KVCache(memcache) on Ascend [WIP][Feature] Support L1+L3 distributed KVCache pool(memcache) on Ascend Nov 28, 2025
@husf1130 husf1130 force-pushed the br_l3_memcache_v2 branch 3 times, most recently from 62bc6f3 to 7a58455 Compare November 29, 2025 02:22
@husf1130 husf1130 changed the title [WIP][Feature] Support L1+L3 distributed KVCache pool(memcache) on Ascend [Feature] Support L1+L3 distributed KVCache pool(memcache) on Ascend Nov 29, 2025
@husf1130 husf1130 force-pushed the br_l3_memcache_v2 branch 4 times, most recently from 1d84277 to 3b9495c Compare December 1, 2025 09:48
Comment thread python/sglang/srt/mem_cache/hicache_storage.py Outdated
Comment thread python/sglang/srt/mem_cache/hicache_storage.py Outdated
Comment thread python/sglang/srt/mem_cache/hicache_storage.py Outdated
Comment thread python/sglang/srt/mem_cache/ascend_radix_cache.py Outdated
@husf1130 husf1130 force-pushed the br_l3_memcache_v2 branch 5 times, most recently from f315186 to be359fc Compare December 2, 2025 05:02
@Cesilina
Copy link
Copy Markdown

Hi @husf1130 ! The changes are encouraging, I had similar ideas in the context of using MoonCake as L3, but they rest on the fact that now in the current implementation of HiCache, two buffers are being assembled from 2 * page_size * layers of buffers in the host memory. I took your code, adapted it to interact with the MoonCake Store, and received infiniband link flaps under load due to the large number of micro-operations. If you completely abandon the local host memory for the worker, such a grouping should probably be carried out in a separate GPU buffer, sacrificing the size of L1.

I tried adding packing and unpacking by pages in a buffer allocated on the gpu. In the first version, without the triton kernel, just using pytorch. (buffer size ~ 1 batch, 3 gb in my setup)

I can share my experiments. Experiment setup:

  • 8 workers with Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 with TP1
  • 1 host with 8xH100
  • sglang bench serving with mooncake dataset (num_rounds=1, slowdown factor=1, max input context ~128k tokens)
  • Page size 128
  • In the L1+L2+L3 scenario, the size of L2 is x2 of L1. Size L3 so that L2+L3 ~ 1 Tb
  • In the L1+L3 scenario, the size of L3 is ~1Tb
Снимок экрана 2025-12-14 в 19 13 02 I am glad that we managed to cut the p99 TTFT. I assume that the avg. ttft degradation can be overcome with little effort. For example, after completing the run, I found that it seems that io-operations in the new direct controller are performed in the same scheduler thread, which could worsen the result, I will retest with the changes.

In general, I got the best results in this experiment when combining the approx kv-cache-aware router (from Nvidia Dynamo) with the old version L1+L2+L3:

Снимок экрана 2025-12-14 в 19 13 19 There is an intuition that with full optimization of the L1+L3 approach, as well as integration with kv-cache-aware routing (L1 + over hosts for L3 to access the cache inside the host), the best result will be obtained.

How do you look to split this PR into 2 - with a new controller and an ascend-specific part? Then I could join the addition of the packing part and integration with MoonCake.

Have you try multi-nodes with this relevant exps ?

@vladnosiv
Copy link
Copy Markdown
Contributor

Have you try multi-nodes with this relevant exps ?

Hi !
I didn't continue the experiments related to the complete rejection of the L2 cache and instead made a recent PR with the L2 mode as a buffer for changing layout - #20535 , which reduces its size to several gb per worker.

@Cesilina
Copy link
Copy Markdown

Have you try multi-nodes with this relevant exps ?

Hi ! I didn't continue the experiments related to the complete rejection of the L2 cache and instead made a recent PR with the L2 mode as a buffer for changing layout - #20535 , which reduces its size to several gb per worker.

Sounds like a solid plan. Good job on the optimization! I’ll go review PR #20535 now."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang npu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants