[Bugfix] DFlash FP8 KV-Cache by benchislett · Pull Request #42692 · vllm-project/vllm

benchislett · 2026-05-15T00:43:55Z

Purpose

There are a few crashes that happen when we run DFlash with FP8 KV Cache, here's why:

We are not currently propagating the Attention quant_config or cache_config to the DFlash decoder layer.
The self.token_arange_np has a dtype mismatch with self.arange in the DFlash setup kernel, so I fix the type for consistency.

Testing

Currently, there are no supported attention backend combinations with non-causal + FP8 KV-Cache support, see #41559.

In the meantime, I tested this PR by manually disabling the causal attention requirement in DFlash. Notably I still saw very high acceptance rates, >4 AL on GSM8k and a passing score for Qwen3.5 35B FP8.

python3 -m \
  vllm.entrypoints.cli.main serve \
  Qwen/Qwen3.6-35B-A3B-FP8 \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 7}' \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 --max-num-batched-tokens 16384 --max-num-seqs 16

(.venv) bchislett@bchislett-ldt:~/Repos/vllm$ python3 tests/evals/gsm8k/gsm8k_eval.py
Running GSM8K evaluation: 1319 questions, 5-shot

Results:
Accuracy: 0.757
Invalid responses: 0.000
Total latency: 110.351 s
Questions per second: 11.953
Total output tokens: 181264
Output tokens per second: 1642.608

baseline, BF16 with no specdec:

Results:
Accuracy: 0.764
Invalid responses: 0.000
Total latency: 160.383 s
Questions per second: 8.224
Total output tokens: 174474
Output tokens per second: 1087.860

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the DFlashQwen3DecoderLayer initialization in qwen3_dflash.py to include cache_config and quant_config. Additionally, it modifies llm_base_proposer.py to explicitly set the data type of token_arange_np to np.int32. I have no feedback to provide as there are no review comments.

mgoin

Looks reasonable to me

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com> Signed-off-by: yilunh <hanyilun1@huawei.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

### What this PR does / why we need it? This PR updates vllm-ascend main2main validation to: - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 Main upstream changes and vllm-ascend adaptations: 1. vLLM PR: vllm-project/vllm#41775 `[Model Runner V2] FP32 gumbel sampling` Upstream changes: - Adds `use_fp64_gumbel` config / argument. - Changes Gumbel sampling and rejection sampling paths to accept `use_fp64`. - Makes FP32 Gumbel the default path and keeps FP64 optional. vllm-ascend adaptation: - Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`. - Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py` to accept `use_fp64`. - Raise `NotImplementedError` for `use_fp64=True` on NPU because the current NPU Triton path does not support FP64 Gumbel / rejection sampling. 2. vLLM PR: vllm-project/vllm#41162 `[Model Runner V2] Rebuild attn metadata between draft decode steps` Upstream changes: - Adds `output_processed_logits` / `output_processed_logits_col` protocol for EAGLE draft sampling. - Uses `output_processed_logits_col` to select the current speculative draft step when writing draft logits. vllm-ascend adaptation: - Add `output_processed_logits` and `output_processed_logits_col` support in NPU `gumbel_sample`. - Store processed logits before adding Gumbel noise so rejection sampling can consume the draft logits. 3. vLLM PR: vllm-project/vllm#42692 `[Bugfix] DFlash FP8 KV-Cache` Upstream changes: - Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy integer dtype to `np.int32`. vllm-ascend adaptation: - Update `vllm_ascend/spec_decode/eagle_proposer.py` to use `np.arange(..., dtype=np.int32)`. - Keep the Ascend-specific `max_num_tokens + 1` behavior for `query_start_loc_cpu[:batch_size + 1]`. 4. No direct upstream vLLM PR vllm-ascend adaptation: - Fix `OP_LOGE` / `OP_LOGW` format warnings in: - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h` - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h` Reason: - The old macros passed `std::string` to `%s`, which fails when the current build treats format warnings as errors. ### Does this PR introduce _any_ user-facing change? No. This PR is for main2main compatibility validation and internal NPU sampling / speculative decoding adaptation. One behavior to note: if users explicitly enable `--use-fp64-gumbel` on NPU, vllm-ascend will raise `NotImplementedError`. ### How was this patch tested? - vLLM version: `v0.20.2` - vLLM main commit: vllm-project/vllm@1ac10f1 - vLLM diff: vllm-project/vllm@0d4d334...1ac10f1 - Validation focus: - NPU Gumbel sampling - EAGLE / MTP speculative decoding - rejection sampling - custom op build --------- Signed-off-by: shenzhao <shenzhao9@huawei.com> Co-authored-by: shenzhao <shenzhao9@huawei.com>

propagate cache and quant config to DFlash attention

748def0

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from MatthewBonanni, luccafong, sighingnow and vadiklyutiy as code owners May 15, 2026 00:43

claude Bot reviewed May 15, 2026

View reviewed changes

benchislett added ready ONLY add when PR is ready to merge/full CI is needed dflash labels May 15, 2026

benchislett commented May 15, 2026

View reviewed changes

Comment thread vllm/model_executor/models/qwen3_dflash.py

mergify Bot added qwen Related to Qwen models speculative-decoding v1 bug Something isn't working labels May 15, 2026

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

mgoin approved these changes May 15, 2026

View reviewed changes

mgoin merged commit 0fe7550 into vllm-project:main May 15, 2026
73 of 74 checks passed

benchislett deleted the dflash-fp8-kv-cache branch May 15, 2026 15:31

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

fc317e1

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

72cef6d

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

0ca4131

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

zhao-stack mentioned this pull request May 20, 2026

[Misc] main2main 0519 vllm-project/vllm-ascend#9238

Merged

gq112 mentioned this pull request May 20, 2026

[Spec Decode] Add FlashInfer metadata grouping for DFlash SWA #43200

Open

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

b49c0e4

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

7e16496

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Liuweixiong0118 pushed a commit to Liuweixiong0118/vllm that referenced this pull request Jun 1, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

a5a8642

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

107a80b

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

30aecb8

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[Bugfix] DFlash FP8 KV-Cache (vllm-project#42692)

944a110

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] DFlash FP8 KV-Cache#42692

[Bugfix] DFlash FP8 KV-Cache#42692
mgoin merged 1 commit into
vllm-project:mainfrom
CentML:dflash-fp8-kv-cache

benchislett commented May 15, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

benchislett commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benchislett commented May 15, 2026 •

edited

Loading