[Bugfix] DFlash FP8 KV-Cache#42692
Merged
Merged
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
benchislett
commented
May 15, 2026
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the DFlashQwen3DecoderLayer initialization in qwen3_dflash.py to include cache_config and quant_config. Additionally, it modifies llm_base_proposer.py to explicitly set the data type of token_arange_np to np.int32. I have no feedback to provide as there are no review comments.
omerpaz95
pushed a commit
to omerpaz95/vllm
that referenced
this pull request
May 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
omerpaz95
pushed a commit
to omerpaz95/vllm
that referenced
this pull request
May 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
mfylcek
pushed a commit
to mfylcek/vllm
that referenced
this pull request
May 19, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
jhu960213
pushed a commit
to jhu960213/vllm
that referenced
this pull request
May 20, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
h1t35h
pushed a commit
to h1t35h/vllm
that referenced
this pull request
May 21, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
wangxiyuan
pushed a commit
to vllm-project/vllm-ascend
that referenced
this pull request
May 25, 2026
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>
Liuweixiong0118
pushed a commit
to Liuweixiong0118/vllm
that referenced
this pull request
Jun 1, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
yilunh998
pushed a commit
to yilunh998/vllm-ascend
that referenced
this pull request
Jun 2, 2026
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: yilunh <hanyilun1@huawei.com>
mvanhorn
pushed a commit
to mvanhorn/vllm
that referenced
this pull request
Jun 4, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai
pushed a commit
to andakai/vllm
that referenced
this pull request
Jun 4, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
knight0528
pushed a commit
to knight0528/vllm
that referenced
this pull request
Jun 8, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
LostFox11
pushed a commit
to LostFox11/vllm-ascend
that referenced
this pull request
Jun 15, 2026
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>
LostFox11
pushed a commit
to LostFox11/vllm-ascend
that referenced
this pull request
Jun 15, 2026
### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
There are a few crashes that happen when we run DFlash with FP8 KV Cache, here's why:
quant_configorcache_configto the DFlash decoder layer.self.token_arange_nphas a dtype mismatch withself.arangein the DFlash setup kernel, so I fix the type for consistency.Testing
Currently, there are no supported attention backend combinations with non-causal + FP8 KV-Cache support, see #41559.
In the meantime, I tested this PR by manually disabling the causal attention requirement in DFlash. Notably I still saw very high acceptance rates, >4 AL on GSM8k and a passing score for Qwen3.5 35B FP8.
python3 -m \ vllm.entrypoints.cli.main serve \ Qwen/Qwen3.6-35B-A3B-FP8 \ --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 7}' \ --kv-cache-dtype fp8 \ --tensor-parallel-size 1 \ --max-model-len 8192 --max-num-batched-tokens 16384 --max-num-seqs 16baseline, BF16 with no specdec: