Skip to content

[Model Runner V2][Bugfix] Fix MRV2 LoRA warmup#35536

Open
jeejeelee wants to merge 31 commits into
vllm-project:mainfrom
jeejeelee:fix-mrv2-lora
Open

[Model Runner V2][Bugfix] Fix MRV2 LoRA warmup#35536
jeejeelee wants to merge 31 commits into
vllm-project:mainfrom
jeejeelee:fix-mrv2-lora

Conversation

@jeejeelee

@jeejeelee jeejeelee commented Feb 27, 2026

Copy link
Copy Markdown
Member

Purpose

WIP

  • Fix LoRA warmup
  • LoRA Verification

Test Plan

export VLLM_USE_V2_MODEL_RUNNER=1
pytest vllm/tests/lora/test_llm_with_multi_loras.py -v -s

Test Result

tests/lora/test_llm_with_multi_loras.py::test_multiple_lora_requests
tests/lora/test_llm_with_multi_loras.py::test_load_inplace_offline_reload
tests/lora/test_llm_with_multi_loras.py::test_load_inplace_false_no_reload
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================ 4 passed, 20 warnings in 101.24s (0:01:41) =============================================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee jeejeelee marked this pull request as draft February 27, 2026 18:15
@mergify mergify Bot added v1 bug Something isn't working labels Feb 27, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix the LoRA warmup in the v2 model runner by wrapping the execute_model call in _dummy_run with the maybe_dummy_run_with_lora context manager. This correctly sets up dummy LoRA adapters for the warmup. However, there is a critical issue in how maybe_dummy_run_with_lora is called, where a boolean value is passed instead of the expected numpy array for num_scheduled_tokens. This will cause a TypeError during execution.

Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify Bot added the nvidia label Feb 28, 2026
@jeejeelee jeejeelee marked this pull request as ready for review February 28, 2026 13:22
@mergify

mergify Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 10, 2026
@mergify mergify Bot removed the needs-rebase label Mar 10, 2026
Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Comment thread vllm/v1/worker/gpu/lora_utils.py Outdated
jeejeelee and others added 3 commits March 13, 2026 19:25
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify

mergify Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 22, 2026
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify Bot added the qwen Related to Qwen models label May 23, 2026
@mergify mergify Bot removed the needs-rebase label May 23, 2026
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026

@WoosukKwon WoosukKwon left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeejeelee Apologies for the late review and thanks again for the PR!

I think the PR is clean overall. I left some stylistic questions. Please take a look.

Comment thread vllm/v1/worker/gpu/lora_utils.py Outdated
Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Comment thread vllm/v1/worker/gpu/cudagraph_utils.py Outdated
num_reqs: int,
num_tokens: int,
uniform_token_count: int | None,
num_active_loras: int = 0,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we remove the default value (0) for num_active_loras across all functions? I think requiring it to be passed explicitly would make the call sites clearer, and it looks like all functions are already called with an explicit num_active_loras argument anyway.

Comment thread vllm/v1/worker/gpu/lora_utils.py Outdated
return [0, lora_config.max_loras + 1]


def resolve_effective_num_active_loras(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit inefficient and verbose to me. Why don't we pre-compute the dispatch mapping like how we handle num_tokens?

Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
Comment thread vllm/v1/worker/gpu/lora_utils.py Outdated
Comment thread vllm/v1/worker/gpu/lora_utils.py Outdated
Comment thread vllm/v1/worker/gpu/model_runner.py Outdated
device: torch.device,
cudagraph_mode: CUDAGraphMode,
decode_query_len: int,
lora_capture_cases: list[int] | None = None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: why cases instead of sizes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just that I used case😅

Comment on lines +88 to +94
def hook(num_active_loras: int, num_reqs: int, num_tokens: int) -> None:
num_scheduled = np.full(num_reqs, num_tokens // num_reqs, dtype=np.int32)
num_scheduled[-1] += num_tokens % num_reqs
with runner.maybe_select_dummy_loras(
lora_config, num_scheduled, num_active_loras=num_active_loras
):
pass

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be in this PR, but it'd be nice if we can think about how to avoid this hack (if you agree it's a hack 😅). I guess we may need to refactor LoRAModelRunnerMixin?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will dig into it in the following PR.

Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
@mergify

mergify Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 9, 2026
jeejeelee and others added 4 commits June 9, 2026 20:47
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify Bot removed the needs-rebase label Jun 10, 2026
@jeejeelee jeejeelee requested a review from WoosukKwon June 10, 2026 09:38
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 12, 2026

@WoosukKwon WoosukKwon left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again for the PR!

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Jun 15, 2026
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
@WoosukKwon WoosukKwon requested a review from yewentao256 as a code owner June 15, 2026 17:02
@mergify mergify Bot removed the needs-rebase label Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1 v2

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

3 participants