Skip to content

[Ascend][feature] support L1+ L2 radixcache on ascend#12214

Merged
hnyls2002 merged 8 commits intosgl-project:mainfrom
khalil2ji3mp6:prefixcache_ascend
Nov 12, 2025
Merged

[Ascend][feature] support L1+ L2 radixcache on ascend#12214
hnyls2002 merged 8 commits intosgl-project:mainfrom
khalil2ji3mp6:prefixcache_ascend

Conversation

@khalil2ji3mp6
Copy link
Copy Markdown
Contributor

@khalil2ji3mp6 khalil2ji3mp6 commented Oct 27, 2025

Motivation

The sglang prefix cache feature already supports a three-level caching strategy (L1: HBM, L2: DRAM, L3: storage) on GPUs. However, on NPUs, prefix cache is currently not supported. Therefore, our plan is to first enable L1 + L2 prefix cache support on NPUs in this PR, and add L3 support in the coming weeks.

Modifications

  1. In server_args.py, we introduced two new parameters — kernel_ascend and page_first_kv_split — based on hicache_io_backend and hicache_mem_layout. These parameters enable better adaptation to different KV layouts and hardware transfer modes on Ascend devices.
  2. In memory_pool_host.py, we added support for KV cache transfers between HBM and CPU on Ascend devices.
    For more details, refer to the following PRs:
  3. In cache_controller.py, we replaced torch.cuda with torch.get_device_module() to ensure compatibility across different device types.
  4. In ascend_backend.py, we implemented a fix to correctly handle the waiting process during KV cache transfers.

Accuracy Tests

Since standard GSM8K requests are too short, they are not well-suited as a benchmark for testing prefix cache performance. We designed a simple script to evaluate the accuracy of the prefix cache. The script runs a 3.5k dataset with 20 samples twice, saves the generated tokens from both runs, and finally compares them to measure accuracy.

Below are our test results. The outputs from both runs were identical, and the execution time clearly indicates that the prefix cache was effective.

first_tokens=[2326, 369, 2585, 1657, 16, 20, 18, 3039, 2506, 1105, 311, 2291, 52293, 13, 18865, 1034, 220, 17, 3984, 17090, 18, 15, 10788, 220, 566, 646, 279, 18575, 279, 4453, 315, 7627, 9761, 553, 323, 220, 304, 220, 264, 220] cost 21s
second_tokens=[2326, 369, 2585, 1657, 16, 20, 18, 3039, 2506, 1105, 311, 2291, 52293, 13, 18865, 1034, 220, 17, 3984, 17090, 18, 15, 10788, 220, 566, 646, 279, 18575, 279, 4453, 315, 7627, 9761, 553, 323, 220, 304, 220, 264, 220] cost 5s

Benchmarking and Profiling

We mainly conducted tests on these two models — DeepSeek-R1_w8a8 and Qwen3-32B — since their KV cache layouts are different. Below are the test results for both models.

In summary, under our test cases (where approximately 30%–40% of requests hit HBM and the rest hit DRAM), the TTFT of Qwen3-32B was reduced by 92.4%, while that of DeepSeek-R1_w8a8 was reduced by 86.9%.

Here are run command for Qwen3-32B.

python3 -m sglang.launch_server \
    --model-path /xxx/Qwen3-32B \
    --host xxx \
    --port 8000 \
    --trust-remote-code \
    --tp-size 2 \
    --mem-fraction-static 0.8 \
    --base-gpu-id 6 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --log-level debug \
    --disable-cuda-graph \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --enable-hierarchical-cache \
    --hicache-write-policy write_through \
    --hicache-ratio 5

Run the benchmark twice using a dataset of 100 samples at 3.5k.

python3 -m sglang.bench_serving \
   --dataset-path /xxx/datasets/GSM8K-in3584-bs200.jsonl \
   --dataset-name gsm8k \
   --backend sglang \
   --host xxx \
   --port 8000 \
   --max-concurrency 8 \
   --random-output-len 1 \
   --random-input-len 3584 \
   --num-prompts 100

First run results:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  86.46     
Total input tokens:                      1542846   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              1.16      
Input token throughput (tok/s):          17843.79  
Output token throughput (tok/s):         1.16      
Total token throughput (tok/s):          17844.94  
Concurrency:                             7.80      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   6741.08   
Median E2E Latency (ms):                 6956.56   
---------------Time to First Token----------------
Mean TTFT (ms):                          6741.05   
Median TTFT (ms):                        6956.53   
P99 TTFT (ms):                           7044.73   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Second run results:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  6.73      
Total input tokens:                      1542846   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              14.85     
Input token throughput (tok/s):          229155.30 
Output token throughput (tok/s):         14.85     
Total token throughput (tok/s):          229170.15 
Concurrency:                             7.62      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   512.80    
Median E2E Latency (ms):                 503.73    
---------------Time to First Token----------------
Mean TTFT (ms):                          512.77    
Median TTFT (ms):                        503.69    
P99 TTFT (ms):                           646.71    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Here are run commands for DeepSeek-R1_w8a8.

python -m sglang.launch_server \
    --model-path /xxx/DeepSeek-R1_w8a8 \
    --host xxx \
    --port 8000 \
    --trust-remote-code \
    --tp-size 16 \
    --mem-fraction-static 0.8 \
    --attention-backend ascend \
    --device npu \
    --disable-overlap-schedule \
    --quantization w8a8_int8 \
    --log-level debug \
    --max-running-requests 8 \
    --context-length 3800 \
    --chunked-prefill-size 57344 \
    --max-prefill-tokens 30400 \
    --cuda-graph-bs 1 2 3 4 5 6 7 8 \
    --enable-hierarchical-cache \
    --hicache-write-policy write_through \
    --hicache-io-backend direct \
    --hicache-mem-layout page_first_direct \
    --hicache-ratio 5

The benchmark test command is the same as that of Qwen3-32B.

First run results:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  72.41     
Total input tokens:                      1542790   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              1.38      
Input token throughput (tok/s):          21305.04  
Output token throughput (tok/s):         1.38      
Total token throughput (tok/s):          21306.42  
Concurrency:                             7.80      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5649.41   
Median E2E Latency (ms):                 5704.97   
---------------Time to First Token----------------
Mean TTFT (ms):                          5649.40   
Median TTFT (ms):                        5704.96   
P99 TTFT (ms):                           6148.02   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Second run results

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     100       
Benchmark duration (s):                  9.61      
Total input tokens:                      1542790   
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Request throughput (req/s):              10.41     
Input token throughput (tok/s):          160529.26 
Output token throughput (tok/s):         10.41     
Total token throughput (tok/s):          160539.66 
Concurrency:                             7.69      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   738.93    
Median E2E Latency (ms):                 750.73    
---------------Time to First Token----------------
Mean TTFT (ms):                          738.92    
Median TTFT (ms):                        750.72    
P99 TTFT (ms):                           796.40    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @khalil2ji3mp6, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's compatibility with Ascend NPUs by integrating specialized memory management and attention mechanisms. The changes enable the use of L1+L2 radixcache on Ascend, ensuring efficient KV cache operations and proper backend selection for optimal performance on these devices. The modifications include device-agnostic event handling, NPU-specific host memory pools, and automatic configuration of the attention backend for Ascend.

Highlights

  • Ascend NPU Integration: Introduced conditional logic to detect Ascend NPUs and adapt memory management and attention mechanisms accordingly, enabling support for L1+L2 radixcache on these devices.
  • Device-Agnostic Event and Stream Handling: Refactored cache_controller.py to use a device-agnostic torch_device (either torch.cuda or torch.npu) for events and streams, improving portability and ensuring correct synchronization on Ascend.
  • Specialized Host Memory Pools for Ascend: Added AscendMHATokenToKVPoolHost and AscendMLATokenToKVPoolHost classes. These new host memory pool implementations handle KV cache operations specifically for Ascend devices, utilizing torch.ops.npu.transfer_kv_dim_exchange for efficient data transfer.
  • Automatic Ascend Attention Backend Selection: Configured the system to automatically select 'ascend' as the decode attention backend when hierarchical cache and kernel backend are enabled on an NPU, ensuring optimal performance for Ascend.
  • KV Buffer Pre-fetching in Ascend Backend: Added explicit calls to retrieve key and value buffers in ascend_backend.py, likely to ensure their availability and proper allocation for Ascend operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for L1+L2 radix cache on Ascend NPUs. The changes primarily involve making the cache controller and memory management components device-agnostic, introducing Ascend-specific implementations for host-side KV cache management, and updating server arguments accordingly. The approach is sound, but I have identified a critical issue regarding unimplemented abstract methods that could lead to runtime errors, along with some suggestions to improve code maintainability and style. Addressing these points will enhance the robustness and clarity of the implementation.

Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated
Comment thread python/sglang/srt/managers/cache_controller.py Outdated
Comment thread python/sglang/srt/managers/cache_controller.py Outdated
Comment thread python/sglang/srt/mem_cache/hiradix_cache.py
Comment thread python/sglang/srt/server_args.py Outdated
@ping1jing2 ping1jing2 changed the title [feature] support L1+ L2 radixcache on ascend [Ascend][feature] support L1+ L2 radixcache on ascend Oct 27, 2025
Copy link
Copy Markdown
Contributor

@husf1130 husf1130 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finish review

Comment thread python/sglang/srt/managers/cache_controller.py Outdated
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiezhq-hermann Could you please take a look?

Comment thread python/sglang/srt/mem_cache/hiradix_cache.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated
Comment thread python/sglang/srt/mem_cache/hiradix_cache.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated
@khalil2ji3mp6 khalil2ji3mp6 force-pushed the prefixcache_ascend branch 2 times, most recently from 975de24 to bd72140 Compare November 1, 2025 11:52
Comment thread python/sglang/srt/managers/cache_controller.py Outdated
Comment thread python/sglang/srt/managers/cache_controller.py Outdated
Comment thread python/sglang/srt/managers/cache_controller.py Outdated
Comment thread python/sglang/srt/managers/cache_controller.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool_host.py Outdated
@khalil2ji3mp6 khalil2ji3mp6 force-pushed the prefixcache_ascend branch 3 times, most recently from 106711b to 8a1bfae Compare November 3, 2025 07:59
Comment thread python/sglang/srt/server_args.py Outdated
@iforgetmyname iforgetmyname requested a review from zhyncs as a code owner November 7, 2025 06:52
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Nov 7, 2025
Comment thread python/sglang/srt/mem_cache/memory_pool_host.py
@github-actions github-actions Bot added the hicache Hierarchical Caching for SGLang label Nov 10, 2025
@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

the latest version looks good to me, final suggestion is to have unit tests for the ascend backend, thanks for the effort on iterating the PR : )

@hnyls2002 hnyls2002 merged commit 2d53194 into sgl-project:main Nov 12, 2025
101 of 113 checks passed

SUPPORT_PIN_MEMORY = not _is_npu
if SUPPORT_PIN_MEMORY:
logger.warning("Current platform not support pin_memory")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do not print annoying messages during the startup
  2. The logic is wrong. It acutally supports pin_memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NPU use mooncake as kvcahe but error [Bug] radix-cache can not work on ascend-npu

9 participants