Skip to content

Feat: support LoRA for embedding layer#8222

Closed
Beichen-Ma wants to merge 39 commits intosgl-project:mainfrom
Beichen-Ma:feat-lora-embedding
Closed

Feat: support LoRA for embedding layer#8222
Beichen-Ma wants to merge 39 commits intosgl-project:mainfrom
Beichen-Ma:feat-lora-embedding

Conversation

@Beichen-Ma
Copy link
Copy Markdown
Contributor

@Beichen-Ma Beichen-Ma commented Jul 21, 2025

Motivation

Integrate LoRA functionality to VocabParallelEmbedding to support efficient parameter fine-tuning.

Modifications

Update LoRA Manager:

  • Added support for embedding modules (embed_tokens) in LoRAManager, including logic to update LoRA weight names, adapters, and memory buffers for embeddings.

Embedding Support Enhancements:

  • Implemented VocabParallelEmbeddingWithLoRA` for embedding-specific operations, including handling new embeddings and applying LoRA transformations.
  • Introduced new attributes (new_embeddings and extra_vocab_size) in LoRA class to manage embeddings and extra vocabulary size.

Memory Pool Updates:

  • Updated LoRAMemoryPool to handle embedding-specific buffers (new_embeddings_buffer, embedding_A_buffer, embedding_B_buffer) and added methods to initialize and manage these buffers.
  • Modified prepare_lora_batch and load_lora_weight_to_buffer methods to include embedding modules in the batch preparation and weight loading processes.

Unit Tests:

  • Added e2e correctness test to make sure the result is the same as HF transformers.

Benchmark

  1. Test 1: compare perf between main vs this branch for an adapter that does not have embedding lora.
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct  --disable-radix-cache --lora-paths lora=algoprog/fact-generation-llama-3.1-8b-instruct-lora
python3 -m sglang.bench_serving --backend sglang  --num-prompt 480 --request-rate 8 --lora-name lora
Metrics Main branch This branch
ITL P95 51.61 ms 45.80 ms
ITL P50 18.97 ms 18.31 ms
E2E Latency P50 2600.79 ms 2551.74 ms
Full Result main feat-embed
  1. Test 2: measure perf between embed enabled vs embed disabled to see the perf implication of the lora embed layer
python3 -m sglang.launch_server --model-path meta-llama/Llama-2-7B-hf  --disable-radix-cache --disable-cuda-graph --lora-paths lora=yard1/llama-2-7b-sql-lora-test --lora-extra-vocab-size 4
python3 -m sglang.bench_serving --backend sglang  --num-prompt 480 --request-rate 8 --lora-name lora
Metrics Disable Enable
ITL P95 61.92 ms 70.42 ms
ITL P50 28.42 ms 23.16 ms
E2E Latency P50 3600.77 ms 3983.47 ms
Full Result Disable Enable

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Beichen-Ma, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the LoRA (Low-Rank Adaptation) functionality to support embedding layers, enabling more efficient fine-tuning of models that modify their input embeddings or introduce new vocabulary tokens. It involves substantial changes across the LoRA management, memory allocation, and layer application components, along with new testing infrastructure to ensure correctness.

Highlights

  • LoRA for Embedding Layers: Introduced VocabParallelEmbeddingWithLoRA to enable LoRA (Low-Rank Adaptation) for embedding layers (embed_tokens). This new class handles the application of LoRA transformations and manages 'added tokens' (new vocabulary tokens introduced by LoRA adapters).
  • LoRA Manager and Memory Pool Updates: The LoRAManager and LoRAMemoryPool have been significantly updated to support embedding-specific LoRA weights. This includes new data structures (lora_embeddings_modules, new_embeddings_buffer, embedding_A_buffer, embedding_B_buffer) and logic to load, manage, and apply these weights efficiently, especially considering tensor parallelism.
  • Dynamic Vocabulary Extension: The LoRAAdapter now tracks new_embeddings and extra_vocab_size, allowing LoRA adapters to extend the model's vocabulary with new tokens. The system dynamically handles these added tokens during the embedding lookup process.
  • CUDA Graph Compatibility: Added a check to disable CUDA graphs when embedding LoRA is used, as it is currently not supported with CUDA graphs. Users will be prompted to use --disable-cuda-graph if an embedding LoRA adapter is detected.
  • Comprehensive Unit and Integration Tests: A new unit test file (test_lora_layer.py) has been added to thoroughly validate the VocabParallelEmbeddingWithLoRA functionality, including various configurations and comparisons against manually computed outputs. Integration tests with the SRT runner using a real-world embedding LoRA adapter are also included.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces LoRA support for embedding layers. The changes are comprehensive, including updates to the LoRA manager, memory pool, and layer implementations. The addition of unit and integration tests is valuable. However, there are a few critical issues related to potential runtime errors (UnboundLocalError and TypeError) that need to be addressed, as well as a suggestion to improve error handling.

Comment thread python/sglang/srt/lora/lora_manager.py Outdated
Comment thread python/sglang/srt/lora/mem_pool.py Outdated
Comment thread python/sglang/srt/model_executor/model_runner.py Outdated
Comment thread python/sglang/srt/lora/layers.py Outdated
Comment thread python/sglang/srt/lora/layers.py Outdated
Copy link
Copy Markdown
Collaborator

@lifuhuang lifuhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! ❤️

Comment thread python/sglang/srt/lora/layers.py
Comment thread python/sglang/srt/model_executor/model_runner.py Outdated
Comment thread python/sglang/srt/lora/lora_manager.py Outdated
Comment thread python/sglang/srt/lora/lora_manager.py Outdated
Comment thread python/sglang/srt/lora/mem_pool.py Outdated
Comment thread test/srt/models/lora/test_lora_layer.py Outdated
@Fridge003 Fridge003 self-assigned this Jul 22, 2025
…support both linear and embedding weights; Refined the test file name.
Comment thread test/srt/models/lora/test_lora_update.py Outdated
Comment thread python/sglang/srt/lora/layers.py Outdated
Comment thread test/srt/models/lora/test_lora_update.py
Comment thread test/srt/lora/test_lora_embedding_layer.py
Comment thread python/sglang/srt/lora/lora_config.py Outdated
Comment thread python/sglang/srt/lora/lora_config.py
Comment thread python/sglang/srt/lora/layers.py
@lifuhuang
Copy link
Copy Markdown
Collaborator

Hi @Beichen-Ma , can you attach the final perf result to the PR description?

  1. Test 1: compare perf between main vs this branch for an adapter that does not have lora.
  2. Test 2: measure perf of an adapter that supports embed (maybe compare between embed enabled vs embed disabled to see the perf implication of the lora embed layer)
    Thanks!

@Beichen-Ma
Copy link
Copy Markdown
Contributor Author

Hi @Beichen-Ma , can you attach the final perf result to the PR description?

  1. Test 1: compare perf between main vs this branch for an adapter that does not have lora.
  2. Test 2: measure perf of an adapter that supports embed (maybe compare between embed enabled vs embed disabled to see the perf implication of the lora embed layer)
    Thanks!

Sure, I added the perf result in the description.

@Beichen-Ma Beichen-Ma requested a review from lifuhuang August 27, 2025 19:31
Comment thread python/sglang/srt/server_args.py
@Fridge003
Copy link
Copy Markdown
Collaborator

@Beichen-Ma What's the status for this PR? Is it ready for merging, or do we need further implementation on lm_head?

@Beichen-Ma
Copy link
Copy Markdown
Contributor Author

Beichen-Ma commented Sep 3, 2025

@Beichen-Ma What's the status for this PR? Is it ready for merging, or do we need further implementation on lm_head?

I added perf benchmark results to the description and it's waiting for review. The lm_head implementation will be in separate PR and is blocked by this one.

Comment thread test/srt/run_suite.py
TestFile("lora/test_multi_lora_backend.py", 60),
TestFile("lora/test_lora_cuda_graph.py", 250),
TestFile("lora/test_lora_update.py", 400),
# TestFile("lora/test_lora_embedding_layer.py", 100),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test cannot pass by now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test can pass when uncommenting lines [1 2]. Since there are no HF adapters that target embed_token without also targeting lm_head. Test is temporarily commented out and will be re-enabled once lm_head support is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants