Skip to content

Support radix cache for Lora feature#7216

Merged
hnyls2002 merged 8 commits intosgl-project:mainfrom
Fridge003:lora_radix
Aug 11, 2025
Merged

Support radix cache for Lora feature#7216
hnyls2002 merged 8 commits intosgl-project:mainfrom
Fridge003:lora_radix

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented Jun 15, 2025

Motivation

Close #2880

Modifications

To support radix cache for Lora, this PR does the following modifications:

  • Add a new LoRARadixCache as a special version of RadixCache. They are mostly the same, except that the node in LoRARadixCache uses hash(lora_id + str(token_ids[0]) as its key for prefix matching, while the node in RadixCache uses token_ids[0] as key for prefix matching. Here I didn't modify the original RadixCache class since the different node keys will introduce a lot of modifications that will contaminate the code.
  • Add test_lora_radix_cache for testing the compatibility of lora and radix cache. In this test we launch a base model with two different adaptors, and the radix cache should be able to differentiate the same prompt tagged with different lora_ids.
  • Enable radix cache by default in all the CI tests

Next Steps (#9144)

Currently when enabling lora and radix cache together, some features are not supported:

  • page_size > 1
  • schedule policy other than fcfs
  • hierarchical cache
    They should be supported in the future

Also, the unified paging technique with Radix Cache is also to be supported.

Benchmark

Changes in #10092 need to be applied to bench_multiturn.py

# Launching with radix cache
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --lora-paths lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora --port 30001

# Launching without radix cache
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --lora-paths lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora --port 30001 --disable-radix-cache

# Benchmark Commands
python3 benchmark/hicache/bench_multiturn.py --model-path meta-llama/Llama-3.1-8B-Instruct --port 30001 --lora-path lora1

Radix cache disabled:

Performance metrics summary:
  Total requests: 1280 at 16 requests per second
  Average TTFT: 0.53
  P90 TTFT: 1.33
  Median TTFT: 0.31
  Average latency: 10.28
  P90 latency: 15.33
  Median latency: 11.21
  Input token throughput: 16692.69 tokens per second
  Output token throughput: 641.85 tokens per second
  Request Throughput: 10.03 requests per second
  Cache Hit Rate: 0.000000

Radix cache enabled:

  Total requests: 1280 at 16 requests per second
  Average TTFT: 0.10
  P90 TTFT: 0.18
  Median TTFT: 0.07
  Average latency: 2.91
  P90 latency: 6.42
  Median latency: 2.13
  Input token throughput: 23926.39 tokens per second
  Output token throughput: 920.02 tokens per second
  Request Throughput: 14.38 requests per second
  Cache Hit Rate: 0.625577

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Fridge003, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the radix cache functionality with the LoRA feature by removing the previous restriction that required radix cache to be disabled when LoRA was active. This change allows both features to be used concurrently by default, simplifying configuration and potentially improving performance for LoRA workloads.

Highlights

  • Enable Radix Cache with LoRA: This pull request enables the radix cache feature to be used by default when LoRA is also enabled. Previously, there was an explicit requirement to disable radix cache when using LoRA.
  • Remove Compatibility Assertion: An assertion in the server arguments validation that prevented the combination of LoRA and radix cache has been removed.
  • Update Documentation and Tests: Example command lines in the documentation and test configurations have been updated to remove the --disable-radix-cache flag when using LoRA, reflecting the new default behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully enables the combination of radix cache and LoRA features by removing a previous restriction and updating documentation and test cases to reflect the new default behavior. The changes are straightforward and align with the stated goal that the features are orthogonal.

Comment thread test/srt/models/lora/utils.py
Comment thread test/srt/models/lora/utils.py
Comment thread test/srt/models/lora/utils.py
Comment thread test/srt/models/lora/utils.py
@lifuhuang
Copy link
Copy Markdown
Collaborator

Nice! Is there any perf gains when radix cache is enabled? Also, why was is disabled at the first place?

@lifuhuang
Copy link
Copy Markdown
Collaborator

nit: maybe we should remove this line as well 😄

"--disable-radix-cache",

Copy link
Copy Markdown
Collaborator

@lifuhuang lifuhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lifuhuang lifuhuang mentioned this pull request Jun 16, 2025
7 tasks
@Fridge003 Fridge003 changed the title Enable radix cache for Lora feature [WIP] Enable radix cache for Lora feature Jun 16, 2025
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Jul 6, 2025

@Fridge003 any update?

@Fridge003
Copy link
Copy Markdown
Collaborator Author

@Fridge003 any update?

Still under working...Two of the tests cannot pass

@Edenzzzz
Copy link
Copy Markdown
Contributor

@Fridge003 Could you point to the failed tests? I see more than two failing.

@Fridge003
Copy link
Copy Markdown
Collaborator Author

@Fridge003 Could you point to the failed tests? I see more than two failing.

test/srt/models/lora/test_lora.py and test/srt/models/lora/test_multi_lora_backend.py. Other tests can actually pass.

Comment thread test/srt/models/lora/test_lora.py Outdated
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Jul 27, 2025

Baizhou is this pr ready

@Fridge003
Copy link
Copy Markdown
Collaborator Author

Baizhou is this pr ready

No, I found this feature is not as trivial as I thought. A new memory cache is needed
I might need another 1~2 weeks

@Edenzzzz
Copy link
Copy Markdown
Contributor

No, I found this feature is not as trivial as I thought. A new memory cache is needed I might need another 1~2 weeks

Because of multi-lora?

@Fridge003
Copy link
Copy Markdown
Collaborator Author

No, I found this feature is not as trivial as I thought. A new memory cache is needed I might need another 1~2 weeks

Because of multi-lora?

Yes, the same prompt might be run on different lora adapters

@Edenzzzz
Copy link
Copy Markdown
Contributor

How about storing the output before applying lora

@lifuhuang
Copy link
Copy Markdown
Collaborator

How about storing the output before applying lora

We do need to cache LoRA result as well for perf gains, otherwise we would still need to recompute LoRA each token, layer even if there is a prefix match.

@hnyls2002
Copy link
Copy Markdown
Collaborator

@Fridge003 LGTM, not a lora expert, but with two small questions:

  1. Why choose to define the key hash(lora_id + str(token_ids[0]) instead of the tuple, just interested in that.
  2. In your summary and the modified code, we can see that most of the changes happen in the part of the KV matching. Is there a chance to define an abstraction about the key, not adding a new radix cache?

@hnyls2002 hnyls2002 merged commit 75e6a7c into sgl-project:main Aug 11, 2025
175 of 189 checks passed
@Fridge003
Copy link
Copy Markdown
Collaborator Author

Fridge003 commented Aug 12, 2025

@Fridge003 LGTM, not a lora expert, but with two small questions:

  1. Why choose to define the key hash(lora_id + str(token_ids[0]) instead of the tuple, just interested in that.
  2. In your summary and the modified code, we can see that most of the changes happen in the part of the KV matching. Is there a chance to define an abstraction about the key, not adding a new radix cache?

@hnyls2002 Really want to thank you for your review~

For your questions:

  1. Changing to tuple for matching is also OK. The key definition is valid as long as it contains lora information for prefix matching. It's just a matter of personal preference.
  2. Modifying the original radix cache is feasible, but not graceful. I will need to add a lot of extra logics to make it work, which will make future maintenance difficult.

@lazyyy77
Copy link
Copy Markdown

lazyyy77 commented Aug 15, 2025

@Fridge003 Hello! I'm doing some work with sglang and lora. I've read the code and here's 2 question.

  1. Since Support dynamic LoRA loading / unloading in engine/server API #7446 have supported dynamic loading of lora, what will happened to the tree cache if I offload a lora adapter who has already hold some nodes?
  2. Since multi-lora prefix matching is being solved, is there any potential difficulties for adapting this LoraRadixCache to HiLoraRadixCache? What's more, how about letting HLRCache's "Hi" support both kv nodes and lora? I'd like to contribute!
    thanks very much! orz orz orz

@Fridge003
Copy link
Copy Markdown
Collaborator Author

Fridge003 commented Aug 15, 2025

@Fridge003 Hello! I'm doing some work with sglang and lora. I've read the code and here's 2 question.

  1. Since Support dynamic LoRA loading / unloading in engine/server API #7446 have supported dynamic loading of lora, what will happened to the tree cache if I offload a lora adapter who has already hold some nodes?
  2. Since multi-lora prefix matching is being solved, is there any potential difficulties for adapting this LoraRadixCache to HiLoraRadixCache? What's more, how about letting HLRCache's "Hi" support both kv nodes and lora? I'd like to contribute!
    thanks very much! orz orz orz

@lazyyy77 Thank you for your great questions!
For question 1, I think it's really meaningful. As you say, when we try to offload a lora adaptor, all its nodes in the radix tree are supposed to be evicted (Or kept if the adaptor might be loaded again?). Current codes doesn't include such behavior, so this feature deserves further inspection.

For question 2, I'm not an expert in hierarchical cache, so I have no concept of potential difficulties. For any questions about hicache you can try contacting @xiezhq-hermann.

@Edenzzzz
Copy link
Copy Markdown
Contributor

Edenzzzz commented Aug 15, 2025

For 1, I think they will be lazily evicted if the adaptor is not re-loaded (though it also depends on the depth of each lora tree, the longest tree will be evicted first).
For 2, lora weights are usually too small to require offloading? As long as key matching is correct it shouldn't be that hard to support HiCache

@lazyyy77
Copy link
Copy Markdown

@Fridge003 Hello! I'm doing some work with sglang and lora. I've read the code and here's 2 question.

  1. Since Support dynamic LoRA loading / unloading in engine/server API #7446 have supported dynamic loading of lora, what will happened to the tree cache if I offload a lora adapter who has already hold some nodes?
  2. Since multi-lora prefix matching is being solved, is there any potential difficulties for adapting this LoraRadixCache to HiLoraRadixCache? What's more, how about letting HLRCache's "Hi" support both kv nodes and lora? I'd like to contribute!
    thanks very much! orz orz orz

@lazyyy77 Thank you for your great questions! For question 1, I think it's really meaningful. As you say, when we try to offload a lora adaptor, all its nodes in the radix tree are supposed to be evicted (Or kept if the adaptor might be loaded again?). Current codes doesn't include such behavior, so this feature deserves further inspection.

For question 2, I'm not an expert in hierarchical cache, so I have no concept of potential difficulties. For any questions about hicache you can try contacting @xiezhq-hermann.

Sure! Thanks a lot, and I'd like to give it a try when I finish current work~

@lazyyy77
Copy link
Copy Markdown

For 1, I think they will be lazily evicted if the adaptor is not re-loaded (though it also depends on the depth of each lora tree, the longest tree will be evicted first). For 2, lora weights are usually too small to require offloading? As long as key matching is correct it shouldn't be that hard to support HiCache

I agree with you! What's more, I know little about sglang's pcie bandwidth. How to evaluate which is better between lazyily offload and directly offload, while the latter one might merge the pieces operation but occupy the whole bandwidth?

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
@lifuhuang lifuhuang mentioned this pull request Aug 18, 2025
26 tasks
@Fridge003 Fridge003 deleted the lora_radix branch August 23, 2025 08:08
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Why can't I use multi-lora adapter and radix attention together?

6 participants