[Model Runner V2][Bugfix] Fix MRV2 LoRA warmup#35536
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix the LoRA warmup in the v2 model runner by wrapping the execute_model call in _dummy_run with the maybe_dummy_run_with_lora context manager. This correctly sets up dummy LoRA adapters for the warmup. However, there is a critical issue in how maybe_dummy_run_with_lora is called, where a boolean value is passed instead of the expected numpy array for num_scheduled_tokens. This will cause a TypeError during execution.
e2f25ff to
d5a7dc2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
WoosukKwon
left a comment
There was a problem hiding this comment.
@jeejeelee Apologies for the late review and thanks again for the PR!
I think the PR is clean overall. I left some stylistic questions. Please take a look.
| num_reqs: int, | ||
| num_tokens: int, | ||
| uniform_token_count: int | None, | ||
| num_active_loras: int = 0, |
There was a problem hiding this comment.
nit: Can we remove the default value (0) for num_active_loras across all functions? I think requiring it to be passed explicitly would make the call sites clearer, and it looks like all functions are already called with an explicit num_active_loras argument anyway.
| return [0, lora_config.max_loras + 1] | ||
|
|
||
|
|
||
| def resolve_effective_num_active_loras( |
There was a problem hiding this comment.
This looks a bit inefficient and verbose to me. Why don't we pre-compute the dispatch mapping like how we handle num_tokens?
| device: torch.device, | ||
| cudagraph_mode: CUDAGraphMode, | ||
| decode_query_len: int, | ||
| lora_capture_cases: list[int] | None = None, |
There was a problem hiding this comment.
just curious: why cases instead of sizes?
There was a problem hiding this comment.
it's just that I used case😅
| def hook(num_active_loras: int, num_reqs: int, num_tokens: int) -> None: | ||
| num_scheduled = np.full(num_reqs, num_tokens // num_reqs, dtype=np.int32) | ||
| num_scheduled[-1] += num_tokens % num_reqs | ||
| with runner.maybe_select_dummy_loras( | ||
| lora_config, num_scheduled, num_active_loras=num_active_loras | ||
| ): | ||
| pass |
There was a problem hiding this comment.
It shouldn't be in this PR, but it'd be nice if we can think about how to avoid this hack (if you agree it's a hack 😅). I guess we may need to refactor LoRAModelRunnerMixin?
There was a problem hiding this comment.
Will dig into it in the following PR.
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
|
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
WoosukKwon
left a comment
There was a problem hiding this comment.
LGTM! Thanks again for the PR!
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Purpose
WIP
Test Plan
export VLLM_USE_V2_MODEL_RUNNER=1
pytest vllm/tests/lora/test_llm_with_multi_loras.py -v -s
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.