Skip to content

Hierarchical Caching supports MLA#4009

Merged
xiezhq-hermann merged 6 commits intosgl-project:mainfrom
zeroorhero:main-dev
Mar 14, 2025
Merged

Hierarchical Caching supports MLA#4009
xiezhq-hermann merged 6 commits intosgl-project:mainfrom
zeroorhero:main-dev

Conversation

@zeroorhero
Copy link
Copy Markdown
Contributor

@zeroorhero zeroorhero commented Mar 3, 2025

Motivation

I am deeply grateful to @xiezhq-hermann for implementing the Hierarchical Caching feature, which has expanded the storage capacity of the KV cache. However, the version only supports MHA and not MLA. This PR introduces support for Hierarchical Caching in the context of MLA.
At present, there might currently be a bug when tp > 1, and @xiezhq-hermann is fixing it.

Modifications

I have abstracted a base class named BaseTokenToKVPoolHost and implemented two classes: MHATokenToKVPoolHost and MLATokenToKVPoolHost.

Benchmark

enable-hierarchical-cache

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path /data00/models/DeepSeek-V2-Lite-Chat --port 30000 --enable-hierarchical-cache --trust-remote-code --mem-fraction-static 0.4
sglang1

disable-hierarchical-cache

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path /data00/models/DeepSeek-V2-Lite-Chat --port 30000 --trust-remote-code --mem-fraction-static 0.4
sglang2

Checklist

Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated
@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

I approve this change despite some style-wise suggestions since the refactoring I am working on will touch on the changes regardless. @zhyncs can you help proceed this PR, there won't be any impact out of the scope of the feature.

@zhyncs zhyncs mentioned this pull request Mar 4, 2025
67 tasks
@lambert0312
Copy link
Copy Markdown
Contributor

I enabled this feature on DeepSeek V3 and found that it would freeze(TP=16). Please optimize it further. Thank you!

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

xiezhq-hermann commented Mar 12, 2025

Hi @zeroorhero can you rebase the PR with the latest main? The PR #4082 fixing TP along with other enhancement has been merged into main.

@zeroorhero
Copy link
Copy Markdown
Contributor Author

Hi @zeroorhero can you rebase the PR with the latest main? The PR #4082 fixing TP along with other enhancement has been merged into main.

ok, i will rebase today.

@lambert0312
Copy link
Copy Markdown
Contributor

lambert0312 commented Mar 13, 2025

I just rebase a version of the code and ran it on DeepSeek V3. It seems to be running normally.

It seems that after I turned on this function, the throughput-related indicators did not change significantly. Am I overlooking something?

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
@xihuai18
Copy link
Copy Markdown
Contributor

It this PR running now?

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>
@zeroorhero zeroorhero requested a review from zhyncs as a code owner March 13, 2025 09:39
@xiezhq-hermann xiezhq-hermann merged commit 0e0ec70 into sgl-project:main Mar 14, 2025
@xihuai18
Copy link
Copy Markdown
Contributor

throughput drop 5% when running deepseek V3 int8 in 2 x A100 nodes

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

xiezhq-hermann commented Mar 14, 2025

throughput drop 5% when running deepseek V3 int8 in 2 x A100 nodes

Yep, the current implementation has some major performance bottleneck due to low IO efficiency and suboptimal scheduling (the fix for supporting TP also sacrificed flexibility of scheduling). Enhancement will be upstreamed gradually but please contact me if you have a urgent need on this feature for now.

@lambert0312
Copy link
Copy Markdown
Contributor

throughput drop 5% when running deepseek V3 int8 in 2 x A100 nodes

the same to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants