Hierarchical Caching supports MLA by zeroorhero · Pull Request #4009 · sgl-project/sglang

zeroorhero · 2025-03-03T04:34:14Z

Motivation

I am deeply grateful to @xiezhq-hermann for implementing the Hierarchical Caching feature, which has expanded the storage capacity of the KV cache. However, the version only supports MHA and not MLA. This PR introduces support for Hierarchical Caching in the context of MLA.
At present, there might currently be a bug when tp > 1, and @xiezhq-hermann is fixing it.

Modifications

I have abstracted a base class named BaseTokenToKVPoolHost and implemented two classes: MHATokenToKVPoolHost and MLATokenToKVPoolHost.

Benchmark

enable-hierarchical-cache

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path /data00/models/DeepSeek-V2-Lite-Chat --port 30000 --enable-hierarchical-cache --trust-remote-code --mem-fraction-static 0.4

disable-hierarchical-cache

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path /data00/models/DeepSeek-V2-Lite-Chat --port 30000 --trust-remote-code --mem-fraction-static 0.4

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

xiezhq-hermann · 2025-03-03T06:13:21Z

I approve this change despite some style-wise suggestions since the refactoring I am working on will touch on the changes regardless. @zhyncs can you help proceed this PR, there won't be any impact out of the scope of the feature.

lambert0312 · 2025-03-05T04:19:15Z

I enabled this feature on DeepSeek V3 and found that it would freeze（TP=16）. Please optimize it further. Thank you!

xiezhq-hermann · 2025-03-12T21:23:50Z

Hi @zeroorhero can you rebase the PR with the latest main? The PR #4082 fixing TP along with other enhancement has been merged into main.

zeroorhero · 2025-03-13T01:57:46Z

Hi @zeroorhero can you rebase the PR with the latest main? The PR #4082 fixing TP along with other enhancement has been merged into main.

ok, i will rebase today.

lambert0312 · 2025-03-13T07:10:30Z

I just rebase a version of the code and ran it on DeepSeek V3. It seems to be running normally.

It seems that after I turned on this function, the throughput-related indicators did not change significantly. Am I overlooking something?

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

xihuai18 · 2025-03-13T08:16:49Z

It this PR running now?

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

xihuai18 · 2025-03-14T06:08:21Z

throughput drop 5% when running deepseek V3 int8 in 2 x A100 nodes

xiezhq-hermann · 2025-03-14T06:30:41Z

throughput drop 5% when running deepseek V3 int8 in 2 x A100 nodes

Yep, the current implementation has some major performance bottleneck due to low IO efficiency and suboptimal scheduling (the fix for supporting TP also sacrificed flexibility of scheduling). Enhancement will be upstreamed gradually but please contact me if you have a urgent need on this feature for now.

lambert0312 · 2025-03-14T07:32:02Z

throughput drop 5% when running deepseek V3 int8 in 2 x A100 nodes

the same to me

zeroorhero requested review from Ying1123, hnyls2002 and merrymercy as code owners March 3, 2025 04:34

xiezhq-hermann requested review from ispobock and xiezhq-hermann March 3, 2025 05:45

xiezhq-hermann approved these changes Mar 3, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated

zhyncs mentioned this pull request Mar 4, 2025

Development Roadmap (2025 H1) #4042

Closed

67 tasks

xiezhq-hermann mentioned this pull request Mar 4, 2025

Hierarchical Caching for SGLang #2693

Merged

3 tasks

This was referenced Mar 5, 2025

Hierarchical Caching Refactoring and Fixing TP issue #4082

Merged

Memory pool fix for upstream change about eagle #4170

Merged

Hierarchical Caching supports MLA

730f2b7

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

zeroorhero force-pushed the main-dev branch from 8564f81 to 730f2b7 Compare March 13, 2025 07:58

Add hierarchical MLATokenToKVPoolHost unit test

5db1238

Signed-off-by: Changqi Lu <luchangqi.123@bytedance.com>

zeroorhero requested a review from zhyncs as a code owner March 13, 2025 09:39

xiezhq-hermann added 2 commits March 13, 2025 10:57

sync for new allocator

d83559b

Merge branch 'main' into main-dev

fc4a288

merrymercy approved these changes Mar 14, 2025

View reviewed changes

xiezhq-hermann added 2 commits March 13, 2025 17:57

Merge branch 'main' into main-dev

1f5f4e0

reduce write through

e45f714

xiezhq-hermann merged commit 0e0ec70 into sgl-project:main Mar 14, 2025

YJHMITWEB mentioned this pull request Jun 10, 2025

[Bug] No performance gain after using hierarchical cache #7059

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical Caching supports MLA#4009

Hierarchical Caching supports MLA#4009
xiezhq-hermann merged 6 commits intosgl-project:mainfrom
zeroorhero:main-dev

zeroorhero commented Mar 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

xiezhq-hermann commented Mar 3, 2025

Uh oh!

lambert0312 commented Mar 5, 2025

Uh oh!

xiezhq-hermann commented Mar 12, 2025 •

edited

Loading

Uh oh!

zeroorhero commented Mar 13, 2025

Uh oh!

lambert0312 commented Mar 13, 2025 •

edited

Loading

Uh oh!

xihuai18 commented Mar 13, 2025

Uh oh!

xihuai18 commented Mar 14, 2025

Uh oh!

xiezhq-hermann commented Mar 14, 2025 •

edited

Loading

Uh oh!

lambert0312 commented Mar 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zeroorhero commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark

enable-hierarchical-cache

disable-hierarchical-cache

Checklist

Uh oh!

Uh oh!

xiezhq-hermann commented Mar 3, 2025

Uh oh!

lambert0312 commented Mar 5, 2025

Uh oh!

xiezhq-hermann commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeroorhero commented Mar 13, 2025

Uh oh!

lambert0312 commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xihuai18 commented Mar 13, 2025

Uh oh!

xihuai18 commented Mar 14, 2025

Uh oh!

xiezhq-hermann commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lambert0312 commented Mar 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zeroorhero commented Mar 3, 2025 •

edited

Loading

xiezhq-hermann commented Mar 12, 2025 •

edited

Loading

lambert0312 commented Mar 13, 2025 •

edited

Loading

xiezhq-hermann commented Mar 14, 2025 •

edited

Loading