[Ascend][feature] support L1+ L2 radixcache on ascend by khalil2ji3mp6 · Pull Request #12214 · sgl-project/sglang

khalil2ji3mp6 · 2025-10-27T12:26:25Z

Motivation

The sglang prefix cache feature already supports a three-level caching strategy (L1: HBM, L2: DRAM, L3: storage) on GPUs. However, on NPUs, prefix cache is currently not supported. Therefore, our plan is to first enable L1 + L2 prefix cache support on NPUs in this PR, and add L3 support in the coming weeks.

Modifications

In server_args.py, we introduced two new parameters — kernel_ascend and page_first_kv_split — based on hicache_io_backend and hicache_mem_layout. These parameters enable better adaptation to different KV layouts and hardware transfer modes on Ascend devices.
In memory_pool_host.py, we added support for KV cache transfers between HBM and CPU on Ascend devices.
For more details, refer to the following PRs:
- add op transfer_kv_dim_exchange sgl-kernel-npu#148
- support kvcacheio sgl-kernel-npu#163
In cache_controller.py, we replaced torch.cuda with torch.get_device_module() to ensure compatibility across different device types.
In ascend_backend.py, we implemented a fix to correctly handle the waiting process during KV cache transfers.

Accuracy Tests

Since standard GSM8K requests are too short, they are not well-suited as a benchmark for testing prefix cache performance. We designed a simple script to evaluate the accuracy of the prefix cache. The script runs a 3.5k dataset with 20 samples twice, saves the generated tokens from both runs, and finally compares them to measure accuracy.

Below are our test results. The outputs from both runs were identical, and the execution time clearly indicates that the prefix cache was effective.

first_tokens=[2326, 369, 2585, 1657, 16, 20, 18, 3039, 2506, 1105, 311, 2291, 52293, 13, 18865, 1034, 220, 17, 3984, 17090, 18, 15, 10788, 220, 566, 646, 279, 18575, 279, 4453, 315, 7627, 9761, 553, 323, 220, 304, 220, 264, 220] cost 21s
second_tokens=[2326, 369, 2585, 1657, 16, 20, 18, 3039, 2506, 1105, 311, 2291, 52293, 13, 18865, 1034, 220, 17, 3984, 17090, 18, 15, 10788, 220, 566, 646, 279, 18575, 279, 4453, 315, 7627, 9761, 553, 323, 220, 304, 220, 264, 220] cost 5s

Benchmarking and Profiling

We mainly conducted tests on these two models — DeepSeek-R1_w8a8 and Qwen3-32B — since their KV cache layouts are different. Below are the test results for both models.

In summary, under our test cases (where approximately 30%–40% of requests hit HBM and the rest hit DRAM), the TTFT of Qwen3-32B was reduced by 92.4%, while that of DeepSeek-R1_w8a8 was reduced by 86.9%.

Here are run command for Qwen3-32B.

python3 -m sglang.launch_server \
    --model-path /xxx/Qwen3-32B \
    --host xxx \
    --port 8000 \
    --trust-remote-code \
    --tp-size 2 \
    --mem-fraction-static 0.8 \
    --base-gpu-id 6 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --log-level debug \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --enable-hierarchical-cache \
    --hicache-write-policy write_through \
    --hicache-ratio 5

Run the benchmark twice using a dataset of 100 samples at 3.5k.

python3 -m sglang.bench_serving \
   --dataset-path /xxx/datasets/GSM8K-in3584-bs200.jsonl \
   --dataset-name gsm8k \
   --backend sglang \
   --host xxx \
   --port 8000 \
   --max-concurrency 8 \
   --random-output-len 1 \
   --random-input-len 3584 \
   --num-prompts 100

First run results:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  86.46     
Total input tokens:                      1542846   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              1.16      
Input token throughput (tok/s):          17843.79  
Output token throughput (tok/s):         1.16      
Total token throughput (tok/s):          17844.94  
Concurrency:                             7.80      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   6741.08   
Median E2E Latency (ms):                 6956.56   
---------------Time to First Token----------------
Mean TTFT (ms):                          6741.05   
Median TTFT (ms):                        6956.53   
P99 TTFT (ms):                           7044.73   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Second run results:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  6.73      
Total input tokens:                      1542846   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              14.85     
Input token throughput (tok/s):          229155.30 
Output token throughput (tok/s):         14.85     
Total token throughput (tok/s):          229170.15 
Concurrency:                             7.62      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   512.80    
Median E2E Latency (ms):                 503.73    
---------------Time to First Token----------------
Mean TTFT (ms):                          512.77    
Median TTFT (ms):                        503.69    
P99 TTFT (ms):                           646.71    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Here are run commands for DeepSeek-R1_w8a8.

python -m sglang.launch_server \
    --model-path /xxx/DeepSeek-R1_w8a8 \
    --host xxx \
    --port 8000 \
    --trust-remote-code \
    --tp-size 16 \
    --mem-fraction-static 0.8 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --quantization w8a8_int8 \
    --log-level debug \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --cuda-graph-bs 1 2 3 4 5 6 7 8 \
    --enable-hierarchical-cache \
    --hicache-write-policy write_through \
    --hicache-io-backend direct \
    --hicache-mem-layout page_first_direct \
    --hicache-ratio 5

The benchmark test command is the same as that of Qwen3-32B.

First run results:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  72.41     
Total input tokens:                      1542790   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              1.38      
Input token throughput (tok/s):          21305.04  
Output token throughput (tok/s):         1.38      
Total token throughput (tok/s):          21306.42  
Concurrency:                             7.80      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5649.41   
Median E2E Latency (ms):                 5704.97   
---------------Time to First Token----------------
Mean TTFT (ms):                          5649.40   
Median TTFT (ms):                        5704.96   
P99 TTFT (ms):                           6148.02   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Second run results

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  9.61      
Total input tokens:                      1542790   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              10.41     
Input token throughput (tok/s):          160529.26 
Output token throughput (tok/s):         10.41     
Total token throughput (tok/s):          160539.66 
Concurrency:                             7.69      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   738.93    
Median E2E Latency (ms):                 750.73    
---------------Time to First Token----------------
Mean TTFT (ms):                          738.92    
Median TTFT (ms):                        750.72    
P99 TTFT (ms):                           796.40    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-27T12:26:45Z

Summary of Changes

Hello @khalil2ji3mp6, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's compatibility with Ascend NPUs by integrating specialized memory management and attention mechanisms. The changes enable the use of L1+L2 radixcache on Ascend, ensuring efficient KV cache operations and proper backend selection for optimal performance on these devices. The modifications include device-agnostic event handling, NPU-specific host memory pools, and automatic configuration of the attention backend for Ascend.

Highlights

Ascend NPU Integration: Introduced conditional logic to detect Ascend NPUs and adapt memory management and attention mechanisms accordingly, enabling support for L1+L2 radixcache on these devices.
Device-Agnostic Event and Stream Handling: Refactored cache_controller.py to use a device-agnostic torch_device (either torch.cuda or torch.npu) for events and streams, improving portability and ensuring correct synchronization on Ascend.
Specialized Host Memory Pools for Ascend: Added AscendMHATokenToKVPoolHost and AscendMLATokenToKVPoolHost classes. These new host memory pool implementations handle KV cache operations specifically for Ascend devices, utilizing torch.ops.npu.transfer_kv_dim_exchange for efficient data transfer.
Automatic Ascend Attention Backend Selection: Configured the system to automatically select 'ascend' as the decode attention backend when hierarchical cache and kernel backend are enabled on an NPU, ensuring optimal performance for Ascend.
KV Buffer Pre-fetching in Ascend Backend: Added explicit calls to retrieve key and value buffers in ascend_backend.py, likely to ensure their availability and proper allocation for Ascend operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for L1+L2 radix cache on Ascend NPUs. The changes primarily involve making the cache controller and memory management components device-agnostic, introducing Ascend-specific implementations for host-side KV cache management, and updating server arguments accordingly. The approach is sound, but I have identified a critical issue regarding unimplemented abstract methods that could lead to runtime errors, along with some suggestions to improve code maintainability and style. Addressing these points will enhance the robustness and clarity of the implementation.

husf1130

finish review

hnyls2002

@xiezhq-hermann Could you please take a look?

…/sglang into prefixcache_ascend

xiezhq-hermann · 2025-11-10T08:20:11Z

the latest version looks good to me, final suggestion is to have unit tests for the ascend backend, thanks for the effort on iterating the PR : )

merrymercy · 2025-11-14T20:01:48Z


+SUPPORT_PIN_MEMORY = not _is_npu
+if SUPPORT_PIN_MEMORY:
+    logger.warning("Current platform not support pin_memory")


Do not print annoying messages during the startup

The logic is wrong. It acutally supports pin_memory

khalil2ji3mp6 requested review from Ying1123, hnyls2002, merrymercy, ping1jing2 and xiezhq-hermann as code owners October 27, 2025 12:26

gemini-code-assist Bot reviewed Oct 27, 2025

View reviewed changes

ping1jing2 changed the title ~~[feature] support L1+ L2 radixcache on ascend~~ [Ascend][feature] support L1+ L2 radixcache on ascend Oct 27, 2025

ping1jing2 mentioned this pull request Oct 28, 2025

[Bug] radix-cache can not work on ascend-npu #10844

Closed

5 tasks

husf1130 reviewed Oct 29, 2025

View reviewed changes

ping1jing2 reviewed Oct 29, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/cache_controller.py Outdated

baoojun mentioned this pull request Oct 30, 2025

[Bug] AttributeError: '_OpNamespace' 'npu' object has no attribute 'transfer_kv_dim_exchange' #12381

Closed

5 tasks

khalil2ji3mp6 force-pushed the prefixcache_ascend branch 3 times, most recently from d7bd12d to 76e0386 Compare October 30, 2025 11:15

ping1jing2 approved these changes Oct 30, 2025

View reviewed changes

husf1130 approved these changes Oct 30, 2025

View reviewed changes

ping1jing2 added the run-ci label Oct 30, 2025

hnyls2002 reviewed Oct 30, 2025

View reviewed changes

This was linked to issues Oct 30, 2025

[Bug] radix-cache can not work on ascend-npu #10844

Closed

NPU use mooncake as kvcahe but error #11055

Closed

xiezhq-hermann self-assigned this Oct 30, 2025

Alcanderian approved these changes Oct 31, 2025

View reviewed changes

xiezhq-hermann reviewed Nov 1, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/hiradix_cache.py Outdated

xiezhq-hermann reviewed Nov 1, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated

ping1jing2 reviewed Nov 1, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/hiradix_cache.py Outdated

ping1jing2 reviewed Nov 1, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated

khalil2ji3mp6 force-pushed the prefixcache_ascend branch 2 times, most recently from 975de24 to bd72140 Compare November 1, 2025 11:52

sglang-bot requested changes Nov 2, 2025

View reviewed changes

khalil2ji3mp6 force-pushed the prefixcache_ascend branch 3 times, most recently from 106711b to 8a1bfae Compare November 3, 2025 07:59

Alcanderian approved these changes Nov 3, 2025

View reviewed changes

xiezhq-hermann reviewed Nov 4, 2025

View reviewed changes

Comment thread python/sglang/srt/server_args.py Outdated

khalil2ji3mp6 force-pushed the prefixcache_ascend branch from 8a1bfae to a7f39c1 Compare November 5, 2025 12:05

support L1+ L2 radixcache on ascend

3fe5c06

khalil2ji3mp6 force-pushed the prefixcache_ascend branch from a7f39c1 to 3fe5c06 Compare November 5, 2025 12:11

khalilzhk and others added 2 commits November 6, 2025 09:37

code review

0ece6b7

Merge branch 'main' into prefixcache_ascend

1f2fcdd

iforgetmyname requested a review from zhyncs as a code owner November 7, 2025 06:52

github-actions Bot added the documentation Improvements or additions to documentation label Nov 7, 2025

sglang-bot approved these changes Nov 9, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/memory_pool_host.py

khalilzhk added 2 commits November 10, 2025 16:02

add test case

415a80d

Merge branch 'prefixcache_ascend' of https://github.com/khalil2ji3mp6…

e850b46

…/sglang into prefixcache_ascend

github-actions Bot added the hicache Hierarchical Caching for SGLang label Nov 10, 2025

xiezhq-hermann approved these changes Nov 10, 2025

View reviewed changes

add test case

66d9e45

khalil2ji3mp6 force-pushed the prefixcache_ascend branch from 4a2a7b1 to 66d9e45 Compare November 10, 2025 13:25

khalil2ji3mp6 and others added 2 commits November 10, 2025 21:35

Merge branch 'main' into prefixcache_ascend

3b5e5d0

Merge branch 'main' into prefixcache_ascend

8c6f1b5

hnyls2002 merged commit 2d53194 into sgl-project:main Nov 12, 2025
101 of 113 checks passed

merrymercy reviewed Nov 14, 2025

View reviewed changes

khalil2ji3mp6 mentioned this pull request Nov 19, 2025

[BugFix] fix prefixcache performance and accuracy on ascend #13573

Merged

6 tasks

iforgetmyname mentioned this pull request Nov 20, 2025

[Bug] ModuleNotFoundError: No module named 'sgl_kernel_npu.kvcacheio' in Docker image lmsysorg/sglang:main-cann8.2.rc1-910b #13648

Closed

5 tasks

khalil2ji3mp6 deleted the prefixcache_ascend branch November 25, 2025 06:29

iforgetmyname mentioned this pull request Jan 23, 2026

[Roadmap] Ascend NPU Development (2026 Q1) #13664

Open

28 tasks

lawtherWu mentioned this pull request Mar 11, 2026

hicache storage backend mooncake support ascend hixl #20016

Merged

5 tasks

Conversation

khalil2ji3mp6 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Oct 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

husf1130 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hnyls2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiezhq-hermann commented Nov 10, 2025

Uh oh!

Uh oh!

merrymercy Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

khalil2ji3mp6 commented Oct 27, 2025 •

edited

Loading

husf1130 left a comment •

edited

Loading