Support radix cache for Lora feature by Fridge003 · Pull Request #7216 · sgl-project/sglang

Fridge003 · 2025-06-15T23:46:35Z

Motivation

Close #2880

Modifications

To support radix cache for Lora, this PR does the following modifications:

Add a new LoRARadixCache as a special version of RadixCache. They are mostly the same, except that the node in LoRARadixCache uses hash(lora_id + str(token_ids[0]) as its key for prefix matching, while the node in RadixCache uses token_ids[0] as key for prefix matching. Here I didn't modify the original RadixCache class since the different node keys will introduce a lot of modifications that will contaminate the code.
Add test_lora_radix_cache for testing the compatibility of lora and radix cache. In this test we launch a base model with two different adaptors, and the radix cache should be able to differentiate the same prompt tagged with different lora_ids.
Enable radix cache by default in all the CI tests

Next Steps (#9144)

Currently when enabling lora and radix cache together, some features are not supported:

page_size > 1
schedule policy other than fcfs
hierarchical cache
They should be supported in the future

Also, the unified paging technique with Radix Cache is also to be supported.

Benchmark

Changes in #10092 need to be applied to bench_multiturn.py

# Launching with radix cache
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --lora-paths lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora --port 30001

# Launching without radix cache
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --lora-paths lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora --port 30001 --disable-radix-cache

# Benchmark Commands
python3 benchmark/hicache/bench_multiturn.py --model-path meta-llama/Llama-3.1-8B-Instruct --port 30001 --lora-path lora1

Radix cache disabled:

Performance metrics summary:
  Total requests: 1280 at 16 requests per second
  Average TTFT: 0.53
  P90 TTFT: 1.33
  Median TTFT: 0.31
  Average latency: 10.28
  P90 latency: 15.33
  Median latency: 11.21
  Input token throughput: 16692.69 tokens per second
  Output token throughput: 641.85 tokens per second
  Request Throughput: 10.03 requests per second
  Cache Hit Rate: 0.000000

Radix cache enabled:

  Total requests: 1280 at 16 requests per second
  Average TTFT: 0.10
  P90 TTFT: 0.18
  Median TTFT: 0.07
  Average latency: 2.91
  P90 latency: 6.42
  Median latency: 2.13
  Input token throughput: 23926.39 tokens per second
  Output token throughput: 920.02 tokens per second
  Request Throughput: 14.38 requests per second
  Cache Hit Rate: 0.625577

gemini-code-assist

Summary of Changes

Hello @Fridge003, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the radix cache functionality with the LoRA feature by removing the previous restriction that required radix cache to be disabled when LoRA was active. This change allows both features to be used concurrently by default, simplifying configuration and potentially improving performance for LoRA workloads.

Highlights

Enable Radix Cache with LoRA: This pull request enables the radix cache feature to be used by default when LoRA is also enabled. Previously, there was an explicit requirement to disable radix cache when using LoRA.
Remove Compatibility Assertion: An assertion in the server arguments validation that prevented the combination of LoRA and radix cache has been removed.
Update Documentation and Tests: Example command lines in the documentation and test configurations have been updated to remove the --disable-radix-cache flag when using LoRA, reflecting the new default behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully enables the combination of radix cache and LoRA features by removing a previous restriction and updating documentation and test cases to reflect the new default behavior. The changes are straightforward and align with the stated goal that the features are orthogonal.

lifuhuang · 2025-06-16T00:22:10Z

Nice! Is there any perf gains when radix cache is enabled? Also, why was is disabled at the first place?

lifuhuang · 2025-06-16T00:27:40Z

nit: maybe we should remove this line as well 😄

sglang/test/srt/test_vision_openai_server_b.py

Line 201 in 96be97b

"--disable-radix-cache",

lifuhuang

LGTM

zhyncs · 2025-07-06T02:09:53Z

@Fridge003 any update?

Fridge003 · 2025-07-06T07:48:48Z

@Fridge003 any update?

Still under working...Two of the tests cannot pass

Edenzzzz · 2025-07-15T00:12:03Z

@Fridge003 Could you point to the failed tests? I see more than two failing.

Fridge003 · 2025-07-15T01:05:13Z

@Fridge003 Could you point to the failed tests? I see more than two failing.

test/srt/models/lora/test_lora.py and test/srt/models/lora/test_multi_lora_backend.py. Other tests can actually pass.

zhyncs · 2025-07-27T22:13:38Z

Baizhou is this pr ready

Fridge003 · 2025-07-27T22:34:17Z

Baizhou is this pr ready

No, I found this feature is not as trivial as I thought. A new memory cache is needed
I might need another 1~2 weeks

Edenzzzz · 2025-07-27T22:49:04Z

No, I found this feature is not as trivial as I thought. A new memory cache is needed I might need another 1~2 weeks

Because of multi-lora?

Fridge003 · 2025-07-27T22:50:45Z

No, I found this feature is not as trivial as I thought. A new memory cache is needed I might need another 1~2 weeks

Because of multi-lora?

Yes, the same prompt might be run on different lora adapters

Edenzzzz · 2025-07-27T22:57:13Z

How about storing the output before applying lora

lifuhuang · 2025-07-27T23:02:57Z

How about storing the output before applying lora

We do need to cache LoRA result as well for perf gains, otherwise we would still need to recompute LoRA each token, layer even if there is a prefix match.

hnyls2002 · 2025-08-11T09:16:14Z

@Fridge003 LGTM, not a lora expert, but with two small questions:

Why choose to define the key hash(lora_id + str(token_ids[0]) instead of the tuple, just interested in that.
In your summary and the modified code, we can see that most of the changes happen in the part of the KV matching. Is there a chance to define an abstraction about the key, not adding a new radix cache?

Fridge003 · 2025-08-12T01:29:04Z

@Fridge003 LGTM, not a lora expert, but with two small questions:

Why choose to define the key hash(lora_id + str(token_ids[0]) instead of the tuple, just interested in that.

In your summary and the modified code, we can see that most of the changes happen in the part of the KV matching. Is there a chance to define an abstraction about the key, not adding a new radix cache?

@hnyls2002 Really want to thank you for your review~

For your questions:

Changing to tuple for matching is also OK. The key definition is valid as long as it contains lora information for prefix matching. It's just a matter of personal preference.
Modifying the original radix cache is feasible, but not graceful. I will need to add a lot of extra logics to make it work, which will make future maintenance difficult.

lazyyy77 · 2025-08-15T01:18:00Z

@Fridge003 Hello! I'm doing some work with sglang and lora. I've read the code and here's 2 question.

Since Support dynamic LoRA loading / unloading in engine/server API #7446 have supported dynamic loading of lora, what will happened to the tree cache if I offload a lora adapter who has already hold some nodes?
Since multi-lora prefix matching is being solved, is there any potential difficulties for adapting this LoraRadixCache to HiLoraRadixCache? What's more, how about letting HLRCache's "Hi" support both kv nodes and lora? I'd like to contribute!
thanks very much! orz orz orz

Fridge003 · 2025-08-15T06:54:41Z

@Fridge003 Hello! I'm doing some work with sglang and lora. I've read the code and here's 2 question.

Since Support dynamic LoRA loading / unloading in engine/server API #7446 have supported dynamic loading of lora, what will happened to the tree cache if I offload a lora adapter who has already hold some nodes?

Since multi-lora prefix matching is being solved, is there any potential difficulties for adapting this LoraRadixCache to HiLoraRadixCache? What's more, how about letting HLRCache's "Hi" support both kv nodes and lora? I'd like to contribute!
thanks very much! orz orz orz

@lazyyy77 Thank you for your great questions!
For question 1, I think it's really meaningful. As you say, when we try to offload a lora adaptor, all its nodes in the radix tree are supposed to be evicted (Or kept if the adaptor might be loaded again?). Current codes doesn't include such behavior, so this feature deserves further inspection.

For question 2, I'm not an expert in hierarchical cache, so I have no concept of potential difficulties. For any questions about hicache you can try contacting @xiezhq-hermann.

Edenzzzz · 2025-08-15T07:16:54Z

For 1, I think they will be lazily evicted if the adaptor is not re-loaded (though it also depends on the depth of each lora tree, the longest tree will be evicted first).
For 2, lora weights are usually too small to require offloading? As long as key matching is correct it shouldn't be that hard to support HiCache

lazyyy77 · 2025-08-15T14:08:50Z

@Fridge003 Hello! I'm doing some work with sglang and lora. I've read the code and here's 2 question.

Since Support dynamic LoRA loading / unloading in engine/server API #7446 have supported dynamic loading of lora, what will happened to the tree cache if I offload a lora adapter who has already hold some nodes?

Since multi-lora prefix matching is being solved, is there any potential difficulties for adapting this LoraRadixCache to HiLoraRadixCache? What's more, how about letting HLRCache's "Hi" support both kv nodes and lora? I'd like to contribute!
thanks very much! orz orz orz

@lazyyy77 Thank you for your great questions! For question 1, I think it's really meaningful. As you say, when we try to offload a lora adaptor, all its nodes in the radix tree are supposed to be evicted (Or kept if the adaptor might be loaded again?). Current codes doesn't include such behavior, so this feature deserves further inspection.

For question 2, I'm not an expert in hierarchical cache, so I have no concept of potential difficulties. For any questions about hicache you can try contacting @xiezhq-hermann.

Sure! Thanks a lot, and I'd like to give it a try when I finish current work~

lazyyy77 · 2025-08-15T14:12:59Z

For 1, I think they will be lazily evicted if the adaptor is not re-loaded (though it also depends on the depth of each lora tree, the longest tree will be evicted first). For 2, lora weights are usually too small to require offloading? As long as key matching is correct it shouldn't be that hard to support HiCache

I agree with you! What's more, I know little about sglang's pcie bandwidth. How to evaluate which is better between lazyily offload and directly offload, while the latter one might merge the pieces operation but occupy the whole bandwidth?

Fridge003 requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy, zhaochenyang20 and zhyncs as code owners June 15, 2025 23:46

gemini-code-assist Bot reviewed Jun 15, 2025

View reviewed changes

Comment thread test/srt/models/lora/utils.py

Comment thread test/srt/models/lora/utils.py

Comment thread test/srt/models/lora/utils.py

Comment thread test/srt/models/lora/utils.py

lifuhuang approved these changes Jun 16, 2025

View reviewed changes

lifuhuang mentioned this pull request Jun 16, 2025

[Feature] Phi-4-MM support #6544

Closed

7 tasks

Fridge003 changed the title ~~Enable radix cache for Lora feature~~ [WIP] Enable radix cache for Lora feature Jun 16, 2025

Fridge003 force-pushed the lora_radix branch from 92cf240 to 49ebae4 Compare July 6, 2025 06:31

Edenzzzz reviewed Jul 17, 2025

View reviewed changes

Comment thread test/srt/models/lora/test_lora.py Outdated

Fridge003 marked this pull request as draft July 20, 2025 01:57

Fridge003 force-pushed the lora_radix branch from 49ebae4 to bac9744 Compare July 20, 2025 04:45

lifuhuang mentioned this pull request Jul 22, 2025

Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching #8261

Merged

6 tasks

Fridge003 force-pushed the lora_radix branch from 1213b0f to 91cbfe7 Compare July 27, 2025 19:09

lifuhuang mentioned this pull request Aug 10, 2025

Development Roadmap (2025 Q3) #7736

Closed

1 task

Fridge003 force-pushed the lora_radix branch from 9ab10f0 to 0bc90b8 Compare August 11, 2025 01:25

Fridge003 added 8 commits August 11, 2025 02:52

Support Radix Cache for Lora

3f2386b

add test

56528de

update doc

80ad541

fix

47591e2

fix some bugs

b12f9ad

small fix

598d673

fix bug

1fb52da

fix run_suite

2576b00

Fridge003 force-pushed the lora_radix branch from 3549dec to 2576b00 Compare August 11, 2025 02:53

zhyncs assigned hnyls2002 Aug 11, 2025

hnyls2002 approved these changes Aug 11, 2025

View reviewed changes

hnyls2002 merged commit 75e6a7c into sgl-project:main Aug 11, 2025
175 of 189 checks passed

Fridge003 mentioned this pull request Aug 11, 2025

[Bug] Why can't I use multi-lora adapter and radix attention together? #2880

Closed

5 tasks

Fridge003 mentioned this pull request Aug 13, 2025

[Feature] Further support for Lora Radix Cache #9144

Closed

2 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Support radix cache for Lora feature (sgl-project#7216)

a96cea9

lifuhuang mentioned this pull request Aug 18, 2025

[Roadmap] Lora Support #2929

Closed

26 tasks

Fridge003 deleted the lora_radix branch August 23, 2025 08:08

Fridge003 mentioned this pull request Sep 1, 2025

Disable radix cache in test_lora_update.py for better stability #9852

Merged

4 tasks

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

Support radix cache for Lora feature (sgl-project#7216)

fb21bce

Conversation

Fridge003 commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Next Steps (#9144)

Benchmark

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifuhuang commented Jun 16, 2025

Uh oh!

lifuhuang commented Jun 16, 2025

Uh oh!

lifuhuang left a comment

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Jul 6, 2025

Uh oh!

Fridge003 commented Jul 6, 2025

Uh oh!

Edenzzzz commented Jul 15, 2025

Uh oh!

Fridge003 commented Jul 15, 2025

Uh oh!

Uh oh!

zhyncs commented Jul 27, 2025

Uh oh!

Fridge003 commented Jul 27, 2025

Uh oh!

Edenzzzz commented Jul 27, 2025

Uh oh!

Fridge003 commented Jul 27, 2025

Uh oh!

Edenzzzz commented Jul 27, 2025

Uh oh!

lifuhuang commented Jul 27, 2025

Uh oh!

hnyls2002 commented Aug 11, 2025

Uh oh!

Uh oh!

Fridge003 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lazyyy77 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lazyyy77 commented Aug 15, 2025

Uh oh!

lazyyy77 commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fridge003 commented Jun 15, 2025 •

edited

Loading

Fridge003 commented Aug 12, 2025 •

edited

Loading

lazyyy77 commented Aug 15, 2025 •

edited

Loading

Fridge003 commented Aug 15, 2025 •

edited

Loading

Edenzzzz commented Aug 15, 2025 •

edited

Loading