Support radix cache for Lora feature#7216
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @Fridge003, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates the radix cache functionality with the LoRA feature by removing the previous restriction that required radix cache to be disabled when LoRA was active. This change allows both features to be used concurrently by default, simplifying configuration and potentially improving performance for LoRA workloads.
Highlights
- Enable Radix Cache with LoRA: This pull request enables the radix cache feature to be used by default when LoRA is also enabled. Previously, there was an explicit requirement to disable radix cache when using LoRA.
- Remove Compatibility Assertion: An assertion in the server arguments validation that prevented the combination of LoRA and radix cache has been removed.
- Update Documentation and Tests: Example command lines in the documentation and test configurations have been updated to remove the
--disable-radix-cacheflag when using LoRA, reflecting the new default behavior.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request successfully enables the combination of radix cache and LoRA features by removing a previous restriction and updating documentation and test cases to reflect the new default behavior. The changes are straightforward and align with the stated goal that the features are orthogonal.
|
Nice! Is there any perf gains when radix cache is enabled? Also, why was is disabled at the first place? |
|
nit: maybe we should remove this line as well 😄 |
|
@Fridge003 any update? |
Still under working...Two of the tests cannot pass |
|
@Fridge003 Could you point to the failed tests? I see more than two failing. |
|
|
Baizhou is this pr ready |
No, I found this feature is not as trivial as I thought. A new memory cache is needed |
Because of multi-lora? |
Yes, the same prompt might be run on different lora adapters |
|
How about storing the output before applying lora |
We do need to cache LoRA result as well for perf gains, otherwise we would still need to recompute LoRA each token, layer even if there is a prefix match. |
|
@Fridge003 LGTM, not a lora expert, but with two small questions:
|
@hnyls2002 Really want to thank you for your review~ For your questions:
|
|
@Fridge003 Hello! I'm doing some work with sglang and lora. I've read the code and here's 2 question.
|
@lazyyy77 Thank you for your great questions! For question 2, I'm not an expert in hierarchical cache, so I have no concept of potential difficulties. For any questions about hicache you can try contacting @xiezhq-hermann. |
|
For 1, I think they will be lazily evicted if the adaptor is not re-loaded (though it also depends on the depth of each lora tree, the longest tree will be evicted first). |
Sure! Thanks a lot, and I'd like to give it a try when I finish current work~ |
I agree with you! What's more, I know little about sglang's pcie bandwidth. How to evaluate which is better between lazyily offload and directly offload, while the latter one might merge the pieces operation but occupy the whole bandwidth? |
Motivation
Close #2880
Modifications
To support radix cache for Lora, this PR does the following modifications:
LoRARadixCacheas a special version ofRadixCache. They are mostly the same, except that the node inLoRARadixCacheuseshash(lora_id + str(token_ids[0])as its key for prefix matching, while the node inRadixCacheusestoken_ids[0]as key for prefix matching. Here I didn't modify the originalRadixCacheclass since the different node keys will introduce a lot of modifications that will contaminate the code.test_lora_radix_cachefor testing the compatibility of lora and radix cache. In this test we launch a base model with two different adaptors, and the radix cache should be able to differentiate the same prompt tagged with different lora_ids.Next Steps (#9144)
Currently when enabling lora and radix cache together, some features are not supported:
They should be supported in the future
Also, the unified paging technique with Radix Cache is also to be supported.
Benchmark
Changes in #10092 need to be applied to
bench_multiturn.pyRadix cache disabled:
Radix cache enabled: