[NPU][Feature] Support kvcache direct transmission between HBM and distributed kv pool#14056
[NPU][Feature] Support kvcache direct transmission between HBM and distributed kv pool#14056husf1130 wants to merge 31 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @husf1130, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's KV cache capabilities by introducing a distributed L1+L3 hierarchical caching system optimized for Ascend devices. It integrates Ascend's MemCache as a high-performance L3 storage layer, complete with a dedicated controller and radix cache implementation. The changes enable more efficient memory management and data transfer between device memory and the L3 cache, aiming to improve overall inference performance on Ascend hardware. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for L1+L3 distributed KVCache on Ascend using memcache, which is a significant new feature. The changes are extensive, touching the cache controllers, scheduling policies, and storage backends, and adding new components like AscendHiCacheController and AscendHiRadixCache for Ascend-specific logic. The overall implementation looks solid. However, I've identified a couple of critical issues that need attention. One is in the eviction logic of AscendHiRadixCache, where backed-up nodes are not being evicted from device memory, potentially leading to memory exhaustion. Another is in AscendMemCacheStore, where the return value of batch_set is inverted, which would cause the caller to misinterpret success as failure. I've also included a minor typing suggestion to improve code quality. Addressing these points will make the PR ready for merging.
8f0f7c2 to
de4e5e6
Compare
|
Hi, why do we need to add a new cache controller and a hiradix tree? |
Firstly, thanks for following this PR. In the kvcache pooling scenario, sglang prefix cache feature supports a 3-level caching strategy (L1: HBM, L2: DRAM, L3: distributed storage). For a new request, there are 2 prefix-match operations and 2 copy operations. We import HiRadixCacheDirect and CacheControllerDirect to support 2-level caching strategy. It will synchronously load kvcache into HBM from storage while making prefill-batch, and while the request finishing, it will write kvcache into storage from HBM. Based on ascend memcache store, we implemented new processes and achieved performance improvements. |
62bc6f3 to
7a58455
Compare
1d84277 to
3b9495c
Compare
f315186 to
be359fc
Compare
* support decode node kvcache offload for enable_hierarchical_cache_direct * support kvcache offload for decode node
Have you try multi-nodes with this relevant exps ? |
Hi ! |
Sounds like a solid plan. Good job on the optimization! I’ll go review PR #20535 now." |


Motivation
In the kvcache pooling scenario, sglang prefix cache feature supports a 3-level caching strategy (L1: HBM, L2: DRAM, L3: distributed storage). For a new request, there are 2 prefix-match operations and 2 copy operations.
We explore 2-level caching strategy(L1 and L3) to simplify the process and reduce the overhead of prefix-match and copy, which only need one prefix-match operation and one copy operation.
Modifications
--enable-hierarchical-cache-directto enable 2-level caching strategy.Accuracy Tests
Prefix Cache testing:
Since standard GSM8K requests are too short, they are not well-suited as a benchmark for testing prefix cache performance. We designed a simple script to evaluate the accuracy of the prefix cache. The script runs a 3.5k dataset with 100 samples twice in 2 sglang instances which share one memcache pool, both output length is 4, saves the generated tokens from both runs, and finally compares them to measure accuracy.
Below are our test results. The outputs from both runs were identical, and the execution time clearly indicates that the prefix cache was effective.
We also test the gsm8k:
python test_accuracy_gsm8k.py
Benchmarking and Profiling
We mainly conducted tests on these two models — DeepSeek-R1_w8a8 and Qwen3-32B.
In summary, under our test cases, using 3.5k+1 dataset, compared to closing L1 and L3:
For Qwen3-32B, TTFT was reduced by 93.5%(100% reuse rate) and 46.6%(50% reuse rate) in A3 2 instance env.
For DeepSeek-R1_w8a8, TTFT was reduced by ~75%(100% reuse rate) and 39%(50% reuse rate) in A3 single instance env.
Below are the test process for DeepSeek R1:
Test method for DeepSeek R1:
Test results for DeepSeek R1 50% Reuse Rate:
Test results for DeepSeek R1 100% Reuse Rate:
**Baseline data closed L1 and L3 for DeepSeek R1: **
Below are the test process for Qwen3-32B:
Test method for Qwen3-32B:
Test results for Qwen3-32B 50% Reuse Rate:
Test results for Qwen3-32B 100% Reuse Rate:
Baseline data closed L1 and L3 for Qwen3-32B:
Checklist