Skip to content

[Bugfix] DFlash FP8 KV-Cache#42692

Merged
mgoin merged 1 commit into
vllm-project:mainfrom
CentML:dflash-fp8-kv-cache
May 15, 2026
Merged

[Bugfix] DFlash FP8 KV-Cache#42692
mgoin merged 1 commit into
vllm-project:mainfrom
CentML:dflash-fp8-kv-cache

Conversation

@benchislett

@benchislett benchislett commented May 15, 2026

Copy link
Copy Markdown
Member

Purpose

There are a few crashes that happen when we run DFlash with FP8 KV Cache, here's why:

  • We are not currently propagating the Attention quant_config or cache_config to the DFlash decoder layer.
  • The self.token_arange_np has a dtype mismatch with self.arange in the DFlash setup kernel, so I fix the type for consistency.

Testing

Currently, there are no supported attention backend combinations with non-causal + FP8 KV-Cache support, see #41559.

In the meantime, I tested this PR by manually disabling the causal attention requirement in DFlash. Notably I still saw very high acceptance rates, >4 AL on GSM8k and a passing score for Qwen3.5 35B FP8.

python3 -m \
  vllm.entrypoints.cli.main serve \
  Qwen/Qwen3.6-35B-A3B-FP8 \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 7}' \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 --max-num-batched-tokens 16384 --max-num-seqs 16
(.venv) bchislett@bchislett-ldt:~/Repos/vllm$ python3 tests/evals/gsm8k/gsm8k_eval.py
Running GSM8K evaluation: 1319 questions, 5-shot

Results:
Accuracy: 0.757
Invalid responses: 0.000
Total latency: 110.351 s
Questions per second: 11.953
Total output tokens: 181264
Output tokens per second: 1642.608

baseline, BF16 with no specdec:

Results:
Accuracy: 0.764
Invalid responses: 0.000
Total latency: 160.383 s
Questions per second: 8.224
Total output tokens: 174474
Output tokens per second: 1087.860

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@benchislett benchislett added ready ONLY add when PR is ready to merge/full CI is needed dflash labels May 15, 2026
Comment thread vllm/model_executor/models/qwen3_dflash.py
@mergify mergify Bot added qwen Related to Qwen models speculative-decoding v1 bug Something isn't working labels May 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the DFlashQwen3DecoderLayer initialization in qwen3_dflash.py to include cache_config and quant_config. Additionally, it modifies llm_base_proposer.py to explicitly set the data type of token_arange_np to np.int32. I have no feedback to provide as there are no review comments.

@mgoin mgoin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me

@mgoin mgoin merged commit 0fe7550 into vllm-project:main May 15, 2026
73 of 74 checks passed
@benchislett benchislett deleted the dflash-fp8-kv-cache branch May 15, 2026 15:31
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request May 25, 2026
### What this PR does / why we need it?

This PR updates vllm-ascend main2main validation to:

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1

Main upstream changes and vllm-ascend adaptations:

1. vLLM PR: vllm-project/vllm#41775  
   `[Model Runner V2] FP32 gumbel sampling`

   Upstream changes:
   - Adds `use_fp64_gumbel` config / argument.
- Changes Gumbel sampling and rejection sampling paths to accept
`use_fp64`.
   - Makes FP32 Gumbel the default path and keeps FP64 optional.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`.
- Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py`
to accept `use_fp64`.
- Raise `NotImplementedError` for `use_fp64=True` on NPU because the
current NPU Triton path does not support FP64 Gumbel / rejection
sampling.

2. vLLM PR: vllm-project/vllm#41162  
   `[Model Runner V2] Rebuild attn metadata between draft decode steps`

   Upstream changes:
- Adds `output_processed_logits` / `output_processed_logits_col`
protocol for EAGLE draft sampling.
- Uses `output_processed_logits_col` to select the current speculative
draft step when writing draft logits.

   vllm-ascend adaptation:
- Add `output_processed_logits` and `output_processed_logits_col`
support in NPU `gumbel_sample`.
- Store processed logits before adding Gumbel noise so rejection
sampling can consume the draft logits.


3. vLLM PR: vllm-project/vllm#42692  
   `[Bugfix] DFlash FP8 KV-Cache`

   Upstream changes:
- Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy
integer dtype to `np.int32`.

   vllm-ascend adaptation:
- Update `vllm_ascend/spec_decode/eagle_proposer.py` to use
`np.arange(..., dtype=np.int32)`.
- Keep the Ascend-specific `max_num_tokens + 1` behavior for
`query_start_loc_cpu[:batch_size + 1]`.

4. No direct upstream vLLM PR

   vllm-ascend adaptation:
   - Fix `OP_LOGE` / `OP_LOGW` format warnings in:
     - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h`
     - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h`

   Reason:
- The old macros passed `std::string` to `%s`, which fails when the
current build treats format warnings as errors.


### Does this PR introduce _any_ user-facing change?

No.

This PR is for main2main compatibility validation and internal NPU
sampling / speculative decoding adaptation.

One behavior to note: if users explicitly enable `--use-fp64-gumbel` on
NPU, vllm-ascend will raise `NotImplementedError`.

### How was this patch tested?

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1
- Validation focus:
  - NPU Gumbel sampling
  - EAGLE / MTP speculative decoding
  - rejection sampling
  - custom op build

---------

Signed-off-by: shenzhao <shenzhao9@huawei.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
Liuweixiong0118 pushed a commit to Liuweixiong0118/vllm that referenced this pull request Jun 1, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
yilunh998 pushed a commit to yilunh998/vllm-ascend that referenced this pull request Jun 2, 2026
### What this PR does / why we need it?

This PR updates vllm-ascend main2main validation to:

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1

Main upstream changes and vllm-ascend adaptations:

1. vLLM PR: vllm-project/vllm#41775
   `[Model Runner V2] FP32 gumbel sampling`

   Upstream changes:
   - Adds `use_fp64_gumbel` config / argument.
- Changes Gumbel sampling and rejection sampling paths to accept
`use_fp64`.
   - Makes FP32 Gumbel the default path and keeps FP64 optional.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`.
- Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py`
to accept `use_fp64`.
- Raise `NotImplementedError` for `use_fp64=True` on NPU because the
current NPU Triton path does not support FP64 Gumbel / rejection
sampling.

2. vLLM PR: vllm-project/vllm#41162
   `[Model Runner V2] Rebuild attn metadata between draft decode steps`

   Upstream changes:
- Adds `output_processed_logits` / `output_processed_logits_col`
protocol for EAGLE draft sampling.
- Uses `output_processed_logits_col` to select the current speculative
draft step when writing draft logits.

   vllm-ascend adaptation:
- Add `output_processed_logits` and `output_processed_logits_col`
support in NPU `gumbel_sample`.
- Store processed logits before adding Gumbel noise so rejection
sampling can consume the draft logits.

3. vLLM PR: vllm-project/vllm#42692
   `[Bugfix] DFlash FP8 KV-Cache`

   Upstream changes:
- Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy
integer dtype to `np.int32`.

   vllm-ascend adaptation:
- Update `vllm_ascend/spec_decode/eagle_proposer.py` to use
`np.arange(..., dtype=np.int32)`.
- Keep the Ascend-specific `max_num_tokens + 1` behavior for
`query_start_loc_cpu[:batch_size + 1]`.

4. No direct upstream vLLM PR

   vllm-ascend adaptation:
   - Fix `OP_LOGE` / `OP_LOGW` format warnings in:
     - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h`
     - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h`

   Reason:
- The old macros passed `std::string` to `%s`, which fails when the
current build treats format warnings as errors.

### Does this PR introduce _any_ user-facing change?

No.

This PR is for main2main compatibility validation and internal NPU
sampling / speculative decoding adaptation.

One behavior to note: if users explicitly enable `--use-fp64-gumbel` on
NPU, vllm-ascend will raise `NotImplementedError`.

### How was this patch tested?

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1
- Validation focus:
  - NPU Gumbel sampling
  - EAGLE / MTP speculative decoding
  - rejection sampling
  - custom op build

---------

Signed-off-by: shenzhao <shenzhao9@huawei.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
Signed-off-by: yilunh <hanyilun1@huawei.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
LostFox11 pushed a commit to LostFox11/vllm-ascend that referenced this pull request Jun 15, 2026
### What this PR does / why we need it?

This PR updates vllm-ascend main2main validation to:

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1

Main upstream changes and vllm-ascend adaptations:

1. vLLM PR: vllm-project/vllm#41775  
   `[Model Runner V2] FP32 gumbel sampling`

   Upstream changes:
   - Adds `use_fp64_gumbel` config / argument.
- Changes Gumbel sampling and rejection sampling paths to accept
`use_fp64`.
   - Makes FP32 Gumbel the default path and keeps FP64 optional.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`.
- Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py`
to accept `use_fp64`.
- Raise `NotImplementedError` for `use_fp64=True` on NPU because the
current NPU Triton path does not support FP64 Gumbel / rejection
sampling.

2. vLLM PR: vllm-project/vllm#41162  
   `[Model Runner V2] Rebuild attn metadata between draft decode steps`

   Upstream changes:
- Adds `output_processed_logits` / `output_processed_logits_col`
protocol for EAGLE draft sampling.
- Uses `output_processed_logits_col` to select the current speculative
draft step when writing draft logits.

   vllm-ascend adaptation:
- Add `output_processed_logits` and `output_processed_logits_col`
support in NPU `gumbel_sample`.
- Store processed logits before adding Gumbel noise so rejection
sampling can consume the draft logits.


3. vLLM PR: vllm-project/vllm#42692  
   `[Bugfix] DFlash FP8 KV-Cache`

   Upstream changes:
- Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy
integer dtype to `np.int32`.

   vllm-ascend adaptation:
- Update `vllm_ascend/spec_decode/eagle_proposer.py` to use
`np.arange(..., dtype=np.int32)`.
- Keep the Ascend-specific `max_num_tokens + 1` behavior for
`query_start_loc_cpu[:batch_size + 1]`.

4. No direct upstream vLLM PR

   vllm-ascend adaptation:
   - Fix `OP_LOGE` / `OP_LOGW` format warnings in:
     - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h`
     - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h`

   Reason:
- The old macros passed `std::string` to `%s`, which fails when the
current build treats format warnings as errors.


### Does this PR introduce _any_ user-facing change?

No.

This PR is for main2main compatibility validation and internal NPU
sampling / speculative decoding adaptation.

One behavior to note: if users explicitly enable `--use-fp64-gumbel` on
NPU, vllm-ascend will raise `NotImplementedError`.

### How was this patch tested?

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1
- Validation focus:
  - NPU Gumbel sampling
  - EAGLE / MTP speculative decoding
  - rejection sampling
  - custom op build

---------

Signed-off-by: shenzhao <shenzhao9@huawei.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
LostFox11 pushed a commit to LostFox11/vllm-ascend that referenced this pull request Jun 15, 2026
### What this PR does / why we need it?

This PR updates vllm-ascend main2main validation to:

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1

Main upstream changes and vllm-ascend adaptations:

1. vLLM PR: vllm-project/vllm#41775  
   `[Model Runner V2] FP32 gumbel sampling`

   Upstream changes:
   - Adds `use_fp64_gumbel` config / argument.
- Changes Gumbel sampling and rejection sampling paths to accept
`use_fp64`.
   - Makes FP32 Gumbel the default path and keeps FP64 optional.

   vllm-ascend adaptation:
- Update `vllm_ascend/worker/v2/sample/gumbel.py` to accept `use_fp64`.
- Update `vllm_ascend/worker/v2/spec_decode/rejection_sampler_utils.py`
to accept `use_fp64`.
- Raise `NotImplementedError` for `use_fp64=True` on NPU because the
current NPU Triton path does not support FP64 Gumbel / rejection
sampling.

2. vLLM PR: vllm-project/vllm#41162  
   `[Model Runner V2] Rebuild attn metadata between draft decode steps`

   Upstream changes:
- Adds `output_processed_logits` / `output_processed_logits_col`
protocol for EAGLE draft sampling.
- Uses `output_processed_logits_col` to select the current speculative
draft step when writing draft logits.

   vllm-ascend adaptation:
- Add `output_processed_logits` and `output_processed_logits_col`
support in NPU `gumbel_sample`.
- Store processed logits before adding Gumbel noise so rejection
sampling can consume the draft logits.


3. vLLM PR: vllm-project/vllm#42692  
   `[Bugfix] DFlash FP8 KV-Cache`

   Upstream changes:
- Changes `SpecDecodeBaseProposer.token_arange_np` from default NumPy
integer dtype to `np.int32`.

   vllm-ascend adaptation:
- Update `vllm_ascend/spec_decode/eagle_proposer.py` to use
`np.arange(..., dtype=np.int32)`.
- Keep the Ascend-specific `max_num_tokens + 1` behavior for
`query_start_loc_cpu[:batch_size + 1]`.

4. No direct upstream vLLM PR

   vllm-ascend adaptation:
   - Fix `OP_LOGE` / `OP_LOGW` format warnings in:
     - `csrc/moe/chunk_fwd_o/tiling_base/error_log.h`
     - `csrc/moe/chunk_gated_delta_rule_fwd_h/tiling_base/error_log.h`

   Reason:
- The old macros passed `std::string` to `%s`, which fails when the
current build treats format warnings as errors.


### Does this PR introduce _any_ user-facing change?

No.

This PR is for main2main compatibility validation and internal NPU
sampling / speculative decoding adaptation.

One behavior to note: if users explicitly enable `--use-fp64-gumbel` on
NPU, vllm-ascend will raise `NotImplementedError`.

### How was this patch tested?

- vLLM version: `v0.20.2`
- vLLM main commit:
vllm-project/vllm@1ac10f1
- vLLM diff:
vllm-project/vllm@0d4d334...1ac10f1
- Validation focus:
  - NPU Gumbel sampling
  - EAGLE / MTP speculative decoding
  - rejection sampling
  - custom op build

---------

Signed-off-by: shenzhao <shenzhao9@huawei.com>
Co-authored-by: shenzhao <shenzhao9@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working dflash qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants