[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded` by LucasWilkinson · Pull Request #30173 · vllm-project/vllm

LucasWilkinson · 2025-12-06T05:32:01Z

There's a bug with FULL_DECODE_ONLY (FULL_AND_PIECEWISE works fine) with DP. There is an edge case where one rank runs eager but all other ranks want to run with cudagraphs, so now we synchronize the cudagraph mode each rank wants to run as across all ranks. Since currently PIECEWISE can be treated as eager (valid in all the same situations) it is sufficient to just disable full cudagraphs if all ranks want to; we make want to pass an explicit list of valid modes in the future.

FIXES: #28579 (comment)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-06T05:34:56Z

vllm/v1/worker/dp_utils.py

    )

-    return (should_ubatch, num_tokens_after_padding)
+    return (should_ubatch, num_tokens_after_padding, synced_cudagraph_mode)


Align coordinate_batch_across_dp unpacking with new return

coordinate_batch_across_dp now returns three values including the synchronized cudagraph mode (return at line 240), but callers such as set_forward_context in forward_context.py (around lines 295–300) and eagle._pad_batch_across_dp in v1/spec_decode/eagle.py (around lines 1261–1269) still unpack only two items. In multi-DP runs where these paths invoke coordinate_batch_across_dp, Python will raise ValueError: too many values to unpack before padding or execution begins, breaking DP execution for forward contexts and EAGLE. Callers need to accept the third element or the function must preserve the previous 2-tuple interface.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request addresses a bug related to CUDA graph mode synchronization in a data parallel setup. The changes ensure that all ranks agree on a common CUDA graph mode by communicating their preferred mode and selecting the minimum one. This prevents assertion failures when different ranks attempt to use incompatible modes (e.g., one eager, others with cudagraphs). The implementation correctly passes the cudagraph mode during the all-reduce synchronization and uses the synchronized mode to make dispatching decisions. The logic appears sound and effectively fixes the described issue. I have one suggestion to improve code maintainability by replacing magic numbers with named constants.

gemini-code-assist · 2025-12-06T05:42:58Z

vllm/v1/worker/dp_utils.py

+    tensor = torch.zeros(5, dp_size, device=device, dtype=torch.int32)
    tensor[0][dp_rank] = orig_num_tokens_per_ubatch
    tensor[1][dp_rank] = padded_num_tokens_per_ubatch
    tensor[2][dp_rank] = 1 if should_ubatch else 0
    tensor[3][dp_rank] = 1 if should_dp_pad else 0
+    tensor[4][dp_rank] = cudagraph_mode


The use of magic numbers 0, 1, 2, 3, 4 for indexing into the tensor makes the code hard to read and maintain. It's not immediately clear what each index represents without looking at the surrounding code or comments. This pattern is also present in _post_process_cudagraph_mode with tensor[4, :]. This can lead to bugs if the order or size of the tensor changes.

I recommend defining these indices as constants at the module level, for example, using an Enum. This would make the code self-documenting and less error-prone across all functions that use this tensor (_run_ar, _post_process_ubatch, _post_process_dp_padding, _post_process_cudagraph_mode).

For example:

from enum import IntEnum class DPSync(IntEnum): ORIG_NUM_TOKENS_PER_UBATCH = 0 PADDED_NUM_TOKENS_PER_UBATCH = 1 SHOULD_UBATCH = 2 SHOULD_DP_PAD = 3 CUDAGRAPH_MODE = 4 TENSOR_SIZE = 5

Then you could use tensor[DPSync.CUDAGRAPH_MODE] instead of tensor[4].

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

hjjq · 2025-12-08T16:12:51Z

Thanks @LucasWilkinson ! The error is gone for me.

SageMoore

Thanks for the fix, @LucasWilkinson!

tlrmchlsmth

makes sense, LGTM

…es in dp ranks (#6011) ### What this PR does / why we need it? This PR aims to fix the issue that using A2 + AIV will hang due to the fact that HCCL does not support eager/graph mode communication. To handle it, following vllm-project/vllm#30173, we introduce `synced_cudagraph_mode` to enable all ranks to know the minimum mode across ranks. Main changes are described below: 1. `execute_model` now performs "dispatch -> sync -> re-dispatch" just as `_dummy_run` 2. `_sync_metadata_across_dp` now receives `cudagraph_mode` from all ranks and returns `synced_cudagraph_mode` to all ranks 3. Re-dispatch steps in both `execute_model` and `_dummy_run` include `disable_full=synced_cudagraph_mode <= CUDAGraphMode.PIECEWISE.value` so that when it is true, no FULL will be dispatched ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>

vllm-project#30173) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…es in dp ranks (vllm-project#6011) ### What this PR does / why we need it? This PR aims to fix the issue that using A2 + AIV will hang due to the fact that HCCL does not support eager/graph mode communication. To handle it, following vllm-project/vllm#30173, we introduce `synced_cudagraph_mode` to enable all ranks to know the minimum mode across ranks. Main changes are described below: 1. `execute_model` now performs "dispatch -> sync -> re-dispatch" just as `_dummy_run` 2. `_sync_metadata_across_dp` now receives `cudagraph_mode` from all ranks and returns `synced_cudagraph_mode` to all ranks 3. Re-dispatch steps in both `execute_model` and `_dummy_run` include `disable_full=synced_cudagraph_mode <= CUDAGraphMode.PIECEWISE.value` so that when it is true, no FULL will be dispatched ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>

fix

e53e968

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added nvidia v1 labels Dec 6, 2025

github-project-automation bot added this to NVIDIA Dec 6, 2025

cleanup

05ae1d7

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

chatgpt-codex-connector bot reviewed Dec 6, 2025

View reviewed changes

LucasWilkinson mentioned this pull request Dec 6, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

Merged

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 6, 2025

fix unit tests

8a65208

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson requested review from benchislett and luccafong as code owners December 6, 2025 18:10

mergify bot added the speculative-decoding label Dec 6, 2025

Merge branch 'main' into lwilkinson/fix-dp-assert

c5eac6b

SageMoore approved these changes Dec 9, 2025

View reviewed changes

LucasWilkinson mentioned this pull request Dec 9, 2025

[Core] Whisper enable FULL_DECODE_ONLY CudaGraph #30072

Merged

tlrmchlsmth approved these changes Dec 9, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 9, 2025

tlrmchlsmth merged commit 56037df into vllm-project:main Dec 9, 2025
47 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Dec 9, 2025

LucasWilkinson mentioned this pull request Dec 10, 2025

[Bug]: DSR1 NVFP4 DEP cannot run because assert batch_descriptor.num_tokens == num_tokens_padded #30323

Open

1 task

slippersss mentioned this pull request Jan 19, 2026

[0.13.0][Bugfix] Add synced_cudagraph_mode to limit mixed graph modes in dp ranks vllm-project/vllm-ascend#6011

Merged

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded (

38b6bb5

vllm-project#30173) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded`#30173

[BugFix] Fix `assert batch_descriptor.num_tokens == num_tokens_padded`#30173
tlrmchlsmth merged 4 commits intovllm-project:mainfrom
neuralmagic:lwilkinson/fix-dp-assert

LucasWilkinson commented Dec 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

hjjq commented Dec 8, 2025

Uh oh!

SageMoore left a comment

Uh oh!

tlrmchlsmth left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

LucasWilkinson commented Dec 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

hjjq commented Dec 8, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LucasWilkinson commented Dec 6, 2025 •

edited by github-actions bot

Loading