[Model Runner V2] FP32 gumbel sampling.#41775
Conversation
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
There was a problem hiding this comment.
Code Review
This pull request introduces an optimization for the Gumbel sampler by allowing it to operate in FP32 instead of FP64, controlled by a new environment variable VLLM_SAMPLER_FP64_GUMBEL. While the high-level functions and the gumbel_block_argmax helper were updated, the _gumbel_sample_kernel signature was not modified to accept the new USE_FP64 parameter, which will lead to a TypeError at runtime.
…IS/vllm into patchy/fp32_gumbel_pr
njhill
left a comment
There was a problem hiding this comment.
Thanks @PatchouliTIS, this looks nice to me.
TheEpicDolphin
left a comment
There was a problem hiding this comment.
Changes look good to me too! I think using fp32 by default here is definitely the right move. Can you please double check that draft acceptance rates don't regress from using fp32 vs fp64?
|
Given this is not a temporary flag, what about having an engine flag ( |
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
…into patchy/fp32_gumbel_pr
|
This pull request has merge conflicts that must be resolved before it can be |
Move the env variable into an engine flag. |
|
Hi @PatchouliTIS, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
…into patchy/fp32_gumbel_pr
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
| Requirement already satisfied: pre-commit in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (4.5.1) | ||
| Requirement already satisfied: cfgv>=2.0.0 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (3.5.0) | ||
| Requirement already satisfied: identify>=1.0.0 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (2.6.16) | ||
| Requirement already satisfied: nodeenv>=0.11.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (1.10.0) | ||
| Requirement already satisfied: pyyaml>=5.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (6.0.3) | ||
| Requirement already satisfied: virtualenv>=20.10.0 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from pre-commit) (20.36.1) | ||
| Requirement already satisfied: distlib<1,>=0.3.7 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from virtualenv>=20.10.0->pre-commit) (0.4.0) | ||
| Requirement already satisfied: filelock<4,>=3.20.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from virtualenv>=20.10.0->pre-commit) (3.20.3) | ||
| Requirement already satisfied: platformdirs<5,>=3.9.1 in /data/home/patchychen/miniconda3/lib/python3.13/site-packages (from virtualenv>=20.10.0->pre-commit) (4.5.0) |
There was a problem hiding this comment.
i think this file was added by accident.
There was a problem hiding this comment.
irrelevant file removed.
WoosukKwon
left a comment
There was a problem hiding this comment.
@PatchouliTIS LGTM. Can you please remove the redundant file so that we can merge?
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
…into patchy/fp32_gumbel_pr
Done, irrelevant files removed. |
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: yilunh <hanyilun1@huawei.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com> Co-authored-by: PatchouliTaisa <patchychen@tencent.com>
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>
Purpose
Profile results on H20
Optimized Gumbel sampling precision behavior
In:
vllm/v1/worker/gpu/sample/gumbel.pyvllm/envs.pyThis PR makes FP64 use in Gumbel sampling optional rather than mandatory.
FP32 becomes the default fast option unless FP64 is explicitly enabled, and the code avoids instability by making sure the random uniform is never allowed to hit zero before applying
-log(-log(u)).Nsys profile:
FP64 sampler

FP32 sampler

Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.