Hierarchical Caching supports MLA#4009
Conversation
|
I approve this change despite some style-wise suggestions since the refactoring I am working on will touch on the changes regardless. @zhyncs can you help proceed this PR, there won't be any impact out of the scope of the feature. |
|
I enabled this feature on DeepSeek V3 and found that it would freeze(TP=16). Please optimize it further. Thank you! |
|
Hi @zeroorhero can you rebase the PR with the latest main? The PR #4082 fixing TP along with other enhancement has been merged into main. |
ok, i will rebase today. |
|
I just rebase a version of the code and ran it on DeepSeek V3. It seems to be running normally. It seems that after I turned on this function, the throughput-related indicators did not change significantly. Am I overlooking something? |
Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
|
It this PR running now? |
Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
|
throughput drop 5% when running deepseek V3 int8 in 2 x A100 nodes |
Yep, the current implementation has some major performance bottleneck due to low IO efficiency and suboptimal scheduling (the fix for supporting TP also sacrificed flexibility of scheduling). Enhancement will be upstreamed gradually but please contact me if you have a urgent need on this feature for now. |
the same to me |
Motivation
I am deeply grateful to @xiezhq-hermann for implementing the Hierarchical Caching feature, which has expanded the storage capacity of the KV cache. However, the version only supports MHA and not MLA. This PR introduces support for Hierarchical Caching in the context of MLA.
At present, there might currently be a bug when tp > 1, and @xiezhq-hermann is fixing it.
Modifications
I have abstracted a base class named BaseTokenToKVPoolHost and implemented two classes: MHATokenToKVPoolHost and MLATokenToKVPoolHost.
Benchmark
enable-hierarchical-cache
CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path /data00/models/DeepSeek-V2-Lite-Chat --port 30000 --enable-hierarchical-cache --trust-remote-code --mem-fraction-static 0.4disable-hierarchical-cache
CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path /data00/models/DeepSeek-V2-Lite-Chat --port 30000 --trust-remote-code --mem-fraction-static 0.4Checklist