[Ascend][feature] support L1+ L2 radixcache on ascend#12214
[Ascend][feature] support L1+ L2 radixcache on ascend#12214hnyls2002 merged 8 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @khalil2ji3mp6, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's compatibility with Ascend NPUs by integrating specialized memory management and attention mechanisms. The changes enable the use of L1+L2 radixcache on Ascend, ensuring efficient KV cache operations and proper backend selection for optimal performance on these devices. The modifications include device-agnostic event handling, NPU-specific host memory pools, and automatic configuration of the attention backend for Ascend. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for L1+L2 radix cache on Ascend NPUs. The changes primarily involve making the cache controller and memory management components device-agnostic, introducing Ascend-specific implementations for host-side KV cache management, and updating server arguments accordingly. The approach is sound, but I have identified a critical issue regarding unimplemented abstract methods that could lead to runtime errors, along with some suggestions to improve code maintainability and style. Addressing these points will enhance the robustness and clarity of the implementation.
d7bd12d to
76e0386
Compare
hnyls2002
left a comment
There was a problem hiding this comment.
@xiezhq-hermann Could you please take a look?
975de24 to
bd72140
Compare
106711b to
8a1bfae
Compare
8a1bfae to
a7f39c1
Compare
a7f39c1 to
3fe5c06
Compare
…/sglang into prefixcache_ascend
|
the latest version looks good to me, final suggestion is to have unit tests for the ascend backend, thanks for the effort on iterating the PR : ) |
4a2a7b1 to
66d9e45
Compare
|
|
||
| SUPPORT_PIN_MEMORY = not _is_npu | ||
| if SUPPORT_PIN_MEMORY: | ||
| logger.warning("Current platform not support pin_memory") |
There was a problem hiding this comment.
- Do not print annoying messages during the startup
- The logic is wrong. It acutally supports pin_memory
Motivation
The sglang prefix cache feature already supports a three-level caching strategy (L1: HBM, L2: DRAM, L3: storage) on GPUs. However, on NPUs, prefix cache is currently not supported. Therefore, our plan is to first enable L1 + L2 prefix cache support on NPUs in this PR, and add L3 support in the coming weeks.
Modifications
server_args.py, we introduced two new parameters —kernel_ascendandpage_first_kv_split— based onhicache_io_backendandhicache_mem_layout. These parameters enable better adaptation to different KV layouts and hardware transfer modes on Ascend devices.memory_pool_host.py, we added support for KV cache transfers between HBM and CPU on Ascend devices.For more details, refer to the following PRs:
cache_controller.py, we replacedtorch.cudawithtorch.get_device_module()to ensure compatibility across different device types.ascend_backend.py, we implemented a fix to correctly handle the waiting process during KV cache transfers.Accuracy Tests
Since standard GSM8K requests are too short, they are not well-suited as a benchmark for testing prefix cache performance. We designed a simple script to evaluate the accuracy of the prefix cache. The script runs a 3.5k dataset with 20 samples twice, saves the generated tokens from both runs, and finally compares them to measure accuracy.
Below are our test results. The outputs from both runs were identical, and the execution time clearly indicates that the prefix cache was effective.
Benchmarking and Profiling
We mainly conducted tests on these two models — DeepSeek-R1_w8a8 and Qwen3-32B — since their KV cache layouts are different. Below are the test results for both models.
In summary, under our test cases (where approximately 30%–40% of requests hit HBM and the rest hit DRAM), the TTFT of Qwen3-32B was reduced by 92.4%, while that of DeepSeek-R1_w8a8 was reduced by 86.9%.
Here are run command for Qwen3-32B.
python3 -m sglang.launch_server \ --model-path /xxx/Qwen3-32B \ --host xxx \ --port 8000 \ --trust-remote-code \ --tp-size 2 \ --mem-fraction-static 0.8 \ --base-gpu-id 6 \ --attention-backend ascend \ --device npu \ --disable-overlap-schedule \ --log-level debug \ --disable-cuda-graph \ --max-running-requests 8 \ --context-length 3800 \ --chunked-prefill-size 57344 \ --max-prefill-tokens 30400 \ --enable-hierarchical-cache \ --hicache-write-policy write_through \ --hicache-ratio 5Run the benchmark twice using a dataset of 100 samples at 3.5k.
First run results:
Second run results:
Here are run commands for DeepSeek-R1_w8a8.
python -m sglang.launch_server \ --model-path /xxx/DeepSeek-R1_w8a8 \ --host xxx \ --port 8000 \ --trust-remote-code \ --tp-size 16 \ --mem-fraction-static 0.8 \ --attention-backend ascend \ --device npu \ --disable-overlap-schedule \ --quantization w8a8_int8 \ --log-level debug \ --max-running-requests 8 \ --context-length 3800 \ --chunked-prefill-size 57344 \ --max-prefill-tokens 30400 \ --cuda-graph-bs 1 2 3 4 5 6 7 8 \ --enable-hierarchical-cache \ --hicache-write-policy write_through \ --hicache-io-backend direct \ --hicache-mem-layout page_first_direct \ --hicache-ratio 5The benchmark test command is the same as that of Qwen3-32B.
First run results:
Second run results
Checklist