Feat: support LoRA for embedding layer by Beichen-Ma · Pull Request #8222 · sgl-project/sglang

Beichen-Ma · 2025-07-21T09:19:41Z

Motivation

Integrate LoRA functionality to VocabParallelEmbedding to support efficient parameter fine-tuning.

Modifications

Update LoRA Manager:

Added support for embedding modules (embed_tokens) in LoRAManager, including logic to update LoRA weight names, adapters, and memory buffers for embeddings.

Embedding Support Enhancements:

Implemented VocabParallelEmbeddingWithLoRA` for embedding-specific operations, including handling new embeddings and applying LoRA transformations.
Introduced new attributes (new_embeddings and extra_vocab_size) in LoRA class to manage embeddings and extra vocabulary size.

Memory Pool Updates:

Updated LoRAMemoryPool to handle embedding-specific buffers (new_embeddings_buffer, embedding_A_buffer, embedding_B_buffer) and added methods to initialize and manage these buffers.
Modified prepare_lora_batch and load_lora_weight_to_buffer methods to include embedding modules in the batch preparation and weight loading processes.

Unit Tests:

Added e2e correctness test to make sure the result is the same as HF transformers.

Benchmark

Test 1: compare perf between main vs this branch for an adapter that does not have embedding lora.

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct  --disable-radix-cache --lora-paths lora=algoprog/fact-generation-llama-3.1-8b-instruct-lora
python3 -m sglang.bench_serving --backend sglang  --num-prompt 480 --request-rate 8 --lora-name lora

Metrics	Main branch	This branch
ITL P95	51.61 ms	45.80 ms
ITL P50	18.97 ms	18.31 ms
E2E Latency P50	2600.79 ms	2551.74 ms
Full Result

Test 2: measure perf between embed enabled vs embed disabled to see the perf implication of the lora embed layer

python3 -m sglang.launch_server --model-path meta-llama/Llama-2-7B-hf  --disable-radix-cache --disable-cuda-graph --lora-paths lora=yard1/llama-2-7b-sql-lora-test --lora-extra-vocab-size 4
python3 -m sglang.bench_serving --backend sglang  --num-prompt 480 --request-rate 8 --lora-name lora

Metrics	Disable	Enable
ITL P95	61.92 ms	70.42 ms
ITL P50	28.42 ms	23.16 ms
E2E Latency P50	3600.77 ms	3983.47 ms
Full Result

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…stream/main' into feat-lora-embedding

gemini-code-assist

Summary of Changes

Hello @Beichen-Ma, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the LoRA (Low-Rank Adaptation) functionality to support embedding layers, enabling more efficient fine-tuning of models that modify their input embeddings or introduce new vocabulary tokens. It involves substantial changes across the LoRA management, memory allocation, and layer application components, along with new testing infrastructure to ensure correctness.

Highlights

LoRA for Embedding Layers: Introduced VocabParallelEmbeddingWithLoRA to enable LoRA (Low-Rank Adaptation) for embedding layers (embed_tokens). This new class handles the application of LoRA transformations and manages 'added tokens' (new vocabulary tokens introduced by LoRA adapters).
LoRA Manager and Memory Pool Updates: The LoRAManager and LoRAMemoryPool have been significantly updated to support embedding-specific LoRA weights. This includes new data structures (lora_embeddings_modules, new_embeddings_buffer, embedding_A_buffer, embedding_B_buffer) and logic to load, manage, and apply these weights efficiently, especially considering tensor parallelism.
Dynamic Vocabulary Extension: The LoRAAdapter now tracks new_embeddings and extra_vocab_size, allowing LoRA adapters to extend the model's vocabulary with new tokens. The system dynamically handles these added tokens during the embedding lookup process.
CUDA Graph Compatibility: Added a check to disable CUDA graphs when embedding LoRA is used, as it is currently not supported with CUDA graphs. Users will be prompted to use --disable-cuda-graph if an embedding LoRA adapter is detected.
Comprehensive Unit and Integration Tests: A new unit test file (test_lora_layer.py) has been added to thoroughly validate the VocabParallelEmbeddingWithLoRA functionality, including various configurations and comparisons against manually computed outputs. Integration tests with the SRT runner using a real-world embedding LoRA adapter are also included.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces LoRA support for embedding layers. The changes are comprehensive, including updates to the LoRA manager, memory pool, and layer implementations. The addition of unit and integration tests is valuable. However, there are a few critical issues related to potential runtime errors (UnboundLocalError and TypeError) that need to be addressed, as well as a suggestion to improve error handling.

lifuhuang

Thank you for the contribution! ❤️

…support both linear and embedding weights; Refined the test file name.

…to feat-lora-embedding

lifuhuang · 2025-08-22T04:03:30Z

Hi @Beichen-Ma , can you attach the final perf result to the PR description?

Test 1: compare perf between main vs this branch for an adapter that does not have lora.
Test 2: measure perf of an adapter that supports embed (maybe compare between embed enabled vs embed disabled to see the perf implication of the lora embed layer)
Thanks!

Beichen-Ma · 2025-08-25T00:48:12Z

Hi @Beichen-Ma , can you attach the final perf result to the PR description?

Test 1: compare perf between main vs this branch for an adapter that does not have lora.

Test 2: measure perf of an adapter that supports embed (maybe compare between embed enabled vs embed disabled to see the perf implication of the lora embed layer)
Thanks!

Sure, I added the perf result in the description.

Fridge003 · 2025-09-03T03:11:43Z

@Beichen-Ma What's the status for this PR? Is it ready for merging, or do we need further implementation on lm_head?

Beichen-Ma · 2025-09-03T04:28:41Z

@Beichen-Ma What's the status for this PR? Is it ready for merging, or do we need further implementation on lm_head?

I added perf benchmark results to the description and it's waiting for review. The lm_head implementation will be in separate PR and is blocked by this one.

Fridge003 · 2025-09-03T20:15:57Z

        TestFile("lora/test_multi_lora_backend.py", 60),
        TestFile("lora/test_lora_cuda_graph.py", 250),
        TestFile("lora/test_lora_update.py", 400),
+        # TestFile("lora/test_lora_embedding_layer.py", 100),


So this test cannot pass by now?

The test can pass when uncommenting lines [1 2]. Since there are no HF adapters that target embed_token without also targeting lm_head. Test is temporarily commented out and will be re-enabled once lm_head support is implemented.

Beichen-Ma added 12 commits June 29, 2025 00:37

sync update

5058c9e

add multi-lora support for embed layer

063a4b0

fix seg_lens to int32

f260e0e

update

00bfd8e

fix error

e93d4aa

update unit tests

736829e

Sync and update to adapt to changes. Merge remote-tracking branch 'up…

6c052a8

…stream/main' into feat-lora-embedding

update

3b79f19

Merge remote-tracking branch 'upstream/main' into feat-lora-embedding

8df0831

refine

e365508

refine test

f626d20

reformat

d9879fd

Beichen-Ma requested review from Fridge003, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners July 21, 2025 09:19

gemini-code-assist Bot reviewed Jul 21, 2025

View reviewed changes

lifuhuang reviewed Jul 21, 2025

View reviewed changes

Fridge003 reviewed Jul 22, 2025

View reviewed changes

Comment thread test/srt/models/lora/test_lora_layer.py Outdated

Fridge003 self-assigned this Jul 22, 2025

Beichen-Ma added 2 commits July 22, 2025 09:25

Added a EmbeddingWithLoRA subclass; Generalized lora_weight_names to …

71dd198

…support both linear and embedding weights; Refined the test file name.

reformat

f41a733

lifuhuang reviewed Jul 24, 2025

View reviewed changes

Comment thread test/srt/models/lora/test_lora_update.py Outdated

lifuhuang reviewed Jul 24, 2025

View reviewed changes

Comment thread python/sglang/srt/lora/layers.py Outdated

Fridge003 reviewed Jul 26, 2025

View reviewed changes

Comment thread test/srt/models/lora/test_lora_update.py

Beichen-Ma added 2 commits July 28, 2025 10:50

Merge remote-tracking branch 'upstream/main' into feat-lora-embedding

088531b

refine

b5acfa3

Beichen-Ma added 7 commits August 13, 2025 08:33

fix ci

584bc88

Merge remote-tracking branch 'upstream/main' into feat-lora-embedding

59b09cb

Merge branch 'feat-lora-embedding' of github.com:Beichen-Ma/sglang in…

44a4757

…to feat-lora-embedding

Merge remote-tracking branch 'upstream/main' into feat-lora-embedding

44cd43c

sync

9c6dcd2

Merge remote-tracking branch 'upstream/main' into feat-lora-embedding

c640e24

fix perf degradation issue

aee1cfc

lifuhuang reviewed Aug 22, 2025

View reviewed changes

Comment thread test/srt/lora/test_lora_embedding_layer.py

lifuhuang reviewed Aug 22, 2025

View reviewed changes

Comment thread python/sglang/srt/lora/lora_config.py Outdated

lifuhuang reviewed Aug 22, 2025

View reviewed changes

Comment thread python/sglang/srt/lora/lora_config.py

lifuhuang reviewed Aug 22, 2025

View reviewed changes

Comment thread python/sglang/srt/lora/layers.py

Beichen-Ma added 4 commits August 22, 2025 16:47

Add testfile

78467e5

Add embedding LoRA support for HF runner

f7d1a67

Add e2e correctness test

ab663aa

Merge branch 'main' into feat-lora-embedding

7018ae1

Merge branch 'main' into feat-lora-embedding

a258545

Beichen-Ma requested a review from lifuhuang August 27, 2025 19:31

Beichen-Ma and others added 2 commits August 27, 2025 15:38

Merge branch 'main' into feat-lora-embedding

7967061

Merge branch 'main' into feat-lora-embedding

dbab059

Fridge003 reviewed Sep 2, 2025

View reviewed changes

Comment thread python/sglang/srt/server_args.py

Beichen-Ma added 2 commits September 2, 2025 21:33

Merge branch 'main' into feat-lora-embedding

a72604a

Update server argument document

661631e

Fridge003 reviewed Sep 3, 2025

View reviewed changes

yushengsu-thu mentioned this pull request Dec 1, 2025

[Feature] Add LoRA support for embedding layers #14177

Merged

6 tasks

yhyang201 mentioned this pull request Dec 10, 2025

[Bug] latest main cannot pass CI, failed with test_eagle_infer_beta_dp_attention.py #14766

Closed

5 tasks

yushengsu-thu closed this Dec 30, 2025

Conversation

Beichen-Ma commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifuhuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifuhuang commented Aug 22, 2025

Uh oh!

Beichen-Ma commented Aug 25, 2025

Uh oh!

Uh oh!

Fridge003 commented Sep 3, 2025

Uh oh!

Beichen-Ma commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Beichen-Ma Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Beichen-Ma commented Jul 21, 2025 •

edited

Loading

Beichen-Ma commented Sep 3, 2025 •

edited

Loading