[Model Runner V2] FP32 gumbel sampling. by PatchouliTIS · Pull Request #41775 · vllm-project/vllm

PatchouliTIS · 2026-05-06T02:35:18Z

Purpose

Profile results on H20

Optimized Gumbel sampling precision behavior
In:
vllm/v1/worker/gpu/sample/gumbel.py
vllm/envs.py

This PR makes FP64 use in Gumbel sampling optional rather than mandatory.

FP32 becomes the default fast option unless FP64 is explicitly enabled, and the code avoids instability by making sure the random uniform is never allowed to hit zero before applying -log(-log(u)).

Nsys profile:

FP64 sampler

FP32 sampler

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces an optimization for the Gumbel sampler by allowing it to operate in FP32 instead of FP64, controlled by a new environment variable VLLM_SAMPLER_FP64_GUMBEL. While the high-level functions and the gumbel_block_argmax helper were updated, the _gumbel_sample_kernel signature was not modified to accept the new USE_FP64 parameter, which will lead to a TypeError at runtime.

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

…IS/vllm into patchy/fp32_gumbel_pr

njhill

Thanks @PatchouliTIS, this looks nice to me.

TheEpicDolphin

Changes look good to me too! I think using fp32 by default here is definitely the right move. Can you please double check that draft acceptance rates don't regress from using fp32 vs fp64?

WoosukKwon · 2026-05-12T21:29:13Z

Given this is not a temporary flag, what about having an engine flag (--use-fp64-gumbel or something like that) instead of the env variable?

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

…into patchy/fp32_gumbel_pr

mergify · 2026-05-13T07:18:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @PatchouliTIS.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

PatchouliTIS · 2026-05-13T07:59:12Z

Given this is not a temporary flag, what about having an engine flag (--use-fp64-gumbel or something like that) instead of the env variable?

Move the env variable into an engine flag.

mergify · 2026-05-13T08:04:35Z

Hi @PatchouliTIS, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

…into patchy/fp32_gumbel_pr

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

TheEpicDolphin · 2026-05-14T18:27:00Z

+Requirement already satisfied: pre-commit in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (4.5.1)
+Requirement already satisfied: cfgv>=2.0.0 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (3.5.0)
+Requirement already satisfied: identify>=1.0.0 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (2.6.16)
+Requirement already satisfied: nodeenv>=0.11.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (1.10.0)
+Requirement already satisfied: pyyaml>=5.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (6.0.3)
+Requirement already satisfied: virtualenv>=20.10.0 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (20.36.1)
+Requirement already satisfied: distlib<1,>=0.3.7 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from virtualenv>=20.10.0->pre-commit) (0.4.0)
+Requirement already satisfied: filelock<4,>=3.20.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from virtualenv>=20.10.0->pre-commit) (3.20.3)
+Requirement already satisfied: platformdirs<5,>=3.9.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from virtualenv>=20.10.0->pre-commit) (4.5.0)


i think this file was added by accident.

irrelevant file removed.

WoosukKwon

@PatchouliTIS LGTM. Can you please remove the redundant file so that we can merge?

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

…into patchy/fp32_gumbel_pr

PatchouliTIS · 2026-05-15T01:42:39Z

@PatchouliTIS LGTM. Can you please remove the redundant file so that we can merge?

Done, irrelevant files removed.

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: yilunh <hanyilun1@huawei.com>

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>

fp32 gumbel

cc17d01

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

PatchouliTIS marked this pull request as ready for review May 6, 2026 02:35

PatchouliTIS requested review from WoosukKwon and njhill as code owners May 6, 2026 02:35

claude Bot reviewed May 6, 2026

View reviewed changes

Merge branch 'main' into patchy/fp32_gumbel_pr

3ebe166

mergify Bot added the v1 label May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Comment thread vllm/v1/worker/gpu/sample/gumbel.py

PatchouliTaisa added 2 commits May 6, 2026 16:34

bug fix

1e5daac

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

Merge branch 'patchy/fp32_gumbel_pr' of https://github.com/PatchouliT…

3c94f34

…IS/vllm into patchy/fp32_gumbel_pr

PatchouliTIS mentioned this pull request May 9, 2026

[ModelRunner V2] Speculative Decoding NGram GPU Implementations #40704

Open

4 tasks

Merge branch 'main' into patchy/fp32_gumbel_pr

32665a1

njhill added the verified Run pre-commit for new contributors without triggering other tests label May 12, 2026

njhill reviewed May 12, 2026

View reviewed changes

TheEpicDolphin approved these changes May 12, 2026

View reviewed changes

TheEpicDolphin reviewed May 12, 2026

View reviewed changes

Comment thread vllm/v1/worker/gpu/sample/gumbel.py Outdated

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 12, 2026

PatchouliTaisa added 2 commits May 13, 2026 15:16

fix comments

94efd23

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

Merge branch 'patchy/fp32_gumbel_pr' of github.com:PatchouliTIS/vllm …

06b1dad

…into patchy/fp32_gumbel_pr

PatchouliTIS requested review from ProExpertProg, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners May 13, 2026 07:17

mergify Bot added the needs-rebase label May 13, 2026

mergify Bot removed the needs-rebase label May 13, 2026

Merge branch 'main' into patchy/fp32_gumbel_pr

b12b9e0

PatchouliTaisa and others added 6 commits May 13, 2026 16:15

pre-commits fixed

8e1ff0e

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

Merge branch 'patchy/fp32_gumbel_pr' of github.com:PatchouliTIS/vllm …

e8e1c5c

…into patchy/fp32_gumbel_pr

fp32 type fix for CPU-only

5294f6b

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

type fix

9a74267

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

Merge branch 'main' into patchy/fp32_gumbel_pr

4bfc509

Merge branch 'main' into patchy/fp32_gumbel_pr

8b0b22b

TheEpicDolphin reviewed May 14, 2026

View reviewed changes

WoosukKwon approved these changes May 14, 2026

View reviewed changes

PatchouliTaisa added 2 commits May 15, 2026 09:38

remove irrelevant files

26ec888

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

Merge branch 'patchy/fp32_gumbel_pr' of github.com:PatchouliTIS/vllm …

387c595

…into patchy/fp32_gumbel_pr

WoosukKwon merged commit 0162596 into vllm-project:main May 15, 2026
79 checks passed

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026

[Model Runner V2] FP32 gumbel sampling. (vllm-project#41775)

1b473a1

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026

[Model Runner V2] FP32 gumbel sampling. (vllm-project#41775)

66c2b73

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Model Runner V2] FP32 gumbel sampling. (vllm-project#41775)

0a1eed9

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

zhao-stack mentioned this pull request May 20, 2026

[Misc] main2main 0519 vllm-project/vllm-ascend#9238

Merged

njhill added the v2 label May 20, 2026

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[Model Runner V2] FP32 gumbel sampling. (vllm-project#41775)

66c42ed

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026

[Model Runner V2] FP32 gumbel sampling. (vllm-project#41775)

6e10781

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[Model Runner V2] FP32 gumbel sampling. (vllm-project#41775)

80f25da

Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>

Uh oh!

Conversation

PatchouliTIS commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

TheEpicDolphin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon commented May 12, 2026

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

PatchouliTIS commented May 13, 2026

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

TheEpicDolphin May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PatchouliTIS May 15, 2026

Choose a reason for hiding this comment

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

PatchouliTIS commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PatchouliTIS commented May 6, 2026 •

edited

Loading

TheEpicDolphin May 14, 2026 •

edited

Loading