[BugFix]Fix the issue where there is no parallelism in PP mode#28286
[BugFix]Fix the issue where there is no parallelism in PP mode#28286weireweire wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a throttling mechanism for scheduling new requests in pipeline parallelism (PP) mode. The goal is to prevent the pipeline from stalling by not exhausting the waiting queue at once, thus maintaining a backlog of requests. The logic seems sound for this purpose. However, I've identified a potential critical issue where a ZeroDivisionError could occur if pipeline_parallel_size is misconfigured to be zero. I've provided a suggestion to make the code more robust by adding a check for this.
| if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max( | ||
| 1, | ||
| self.max_num_running_reqs | ||
| // self.parallel_config.pipeline_parallel_size, | ||
| ): | ||
| break |
There was a problem hiding this comment.
The division by self.parallel_config.pipeline_parallel_size on line 361 assumes this value is non-zero. If it were misconfigured to be 0, this would lead to a ZeroDivisionError, causing the scheduler to crash. While pipeline_parallel_size defaults to 1, adding a check to ensure it's positive would make the code more robust against configuration errors. The suggested change also improves readability by extracting the limit into a variable.
| if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max( | |
| 1, | |
| self.max_num_running_reqs | |
| // self.parallel_config.pipeline_parallel_size, | |
| ): | |
| break | |
| pp_size = self.parallel_config.pipeline_parallel_size | |
| if pp_size <= 0: | |
| raise ValueError( | |
| "pipeline_parallel_size must be positive, but is " | |
| f"{pp_size}") | |
| limit = max(1, self.max_num_running_reqs // pp_size) | |
| if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= limit: | |
| break |
5cd2377 to
7a26c8e
Compare
|
Thanks @weireweire. I've made a complete fix in #28768. However, this doesn't include your change to the scheduler:
If I understand correctly, this is an orthogonal optimization? Perhaps you could open another PR with just that change for consideration? |
|
@njhill yes, it's orthogonal. I'll revert the overlap fix and only keep the fix of "run out of request" here. |
|
This pull request has merge conflicts that must be resolved before it can be |
7a26c8e to
c51d37d
Compare
|
@WoosukKwon could you help to review? |
… in ray backend. Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>
c51d37d to
e8a900b
Compare
Purpose
--max-num-seqsall at once. Which cause the latter steps has no request to run and just waiting.--max-num-batched-tokensto avoid reach--max-num-seqsat once, but it only work for prefill--max-num-seqs / PP_sizeto achieve split micro batch.fix undesired blocking in engine core to fix the missing parallel for PP in mp backend.Test Plan
I tested on SM120 with PP8, the command is:
vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --host 0.0.0.0 --port 8000 --pipeline-parallel-size 8 --tensor-parallel-size 1 --max-num-seqs 8 --max-cudagraph-capture-size 8 --max-model-len 4042 --max-num-batched-tokens 32000 --enable-chunked-prefill --kv-cache-dtype auto --gpu-memory-utilization 0.85 --no-enable-prefix-cachingAnd use
--distributed-executor-backendto choose mp or ray backendbench command:
vllm bench serve --model nvidia/DeepSeek-R1-0528-FP4-v2 --host 127.0.0.1 --port 8000 --dataset-name random --max-concurrency 8 --num-prompts 256 --random-input-len 4000 --random-output-len 32 --random-range-ratio 0 --num-warmups 20Test Result
mp backend, before this PR:
mp backend, after this PR:
ray backend, before this pr:
ray backend, after this pr:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.