Skip to content

[V1][Spec Decode] Fix greedy temperature detection after sampler refactor#27077

Merged
njhill merged 3 commits intovllm-project:mainfrom
Pradyun92:fix-eagle-temperature-detection
Oct 17, 2025
Merged

[V1][Spec Decode] Fix greedy temperature detection after sampler refactor#27077
njhill merged 3 commits intovllm-project:mainfrom
Pradyun92:fix-eagle-temperature-detection

Conversation

@Pradyun92
Copy link
Copy Markdown
Contributor

Description

Fixes division by zero and negative acceptance values in speculative decoding caused by incomplete temperature migration in commit a676e66.

Issue

Commit a676e66 changed greedy sampling temperature from -1.0 to 0.0 in gpu_input_batch.py, but several components still checking for temperature == -1 were not updated, causing:

  • Division by zero in rejection sampler
  • NaN/Inf logits
  • Empty generated_token_ids
  • Negative acceptance values (len([]) - 1 = -1)
  • Prometheus metrics crash: "ValueError: Gauge metric cannot decrease"

Root Cause

Multiple components were not updated after the temperature refactor:

  1. TPU input batch (tpu_input_batch.py): Still using -1.0 sentinel value
  2. Eagle (eagle.py): Hardcoded temperature == -1 check in unused function
  3. Rejection sampler (rejection_sampler.py): GREEDY_TEMPERATURE constant set to -1

The rejection sampler's expand_batch_to_tokens() function replaces GREEDY_TEMPERATURE (-1) with 1.0 before division, but when actual temperature is 0.0, the replacement doesn't happen, causing division by zero.

Impact

Affects ALL models using speculative decoding with rejection sampling:

  • Eagle/Eagle3
  • deepseek_mtp
  • ernie_mtp
  • qwen3_next_mtp
  • longcat_flash_mtp

Solution

Complete the temperature migration consistently across all components:

  1. tpu_input_batch.py (line 219): Change temperature = -1.00.0 for consistency with GPU
  2. eagle.py (lines 40, 1144-1151):
    • Import _SAMPLING_EPS from sampler.py
    • Use epsilon comparison: temperature < _SAMPLING_EPS instead of == -1
    • Add if not all_random: check before protection (consistent with sampler.py)
  3. rejection_sampler.py (line 18): Change GREEDY_TEMPERATURE constant from -10

Testing

  • Verified Eagle speculation works correctly with temperature 0.0
  • No longer produces negative acceptance errors
  • Prometheus metrics no longer crash

Related

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in speculative decoding that could lead to division-by-zero errors. The issue stemmed from an incomplete migration of the greedy sampling temperature representation from -1.0 to 0.0 in a previous refactor. Your changes correctly propagate this update across all affected components:

  • In rejection_sampler.py, GREEDY_TEMPERATURE is correctly updated to 0.
  • In eagle.py, the check for greedy sampling is updated to use an epsilon comparison, which is more robust and consistent with the rest of the codebase.
  • In tpu_input_batch.py, the greedy temperature is set to 0.0, aligning it with the GPU implementation.

These changes are well-targeted, correct, and crucial for fixing the reported crashes and incorrect behavior. The pull request is well-documented and the solution is sound.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

Comment thread vllm/v1/worker/tpu_input_batch.py
@Pradyun92 Pradyun92 force-pushed the fix-eagle-temperature-detection branch from c5669f8 to bacd60b Compare October 17, 2025 06:04
@Pradyun92
Copy link
Copy Markdown
Contributor Author

Thank you for the detailed review! You're absolutely correct - I've now fixed the TPU sampler division by zero issue.

Changes Made

Updated the commit to include:

  1. vllm/v1/sample/tpu/sampler.py - Added epsilon guard to apply_temperature():

    def apply_temperature(self, logits, temp, all_random=False):
        # Avoid division by zero for greedy sampling (temperature ~ 0.0).
        if not all_random:
            temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp)
        return logits.div_(temp.unsqueeze(dim=1))
  2. vllm/v1/sample/tpu/metadata.py - Added all_random property to TPUSupportedSamplingMetadata so the sampler can conditionally apply the epsilon guard.

This follows the same pattern as vllm/v1/sample/sampler.py (lines 127-131) which already has this protection.

The fix ensures that:

  • Greedy TPU requests clamp temperature to 1.0 before division (avoiding NaN/Inf logits)
  • Random sampling requests pass through unchanged
  • Consistent behavior across GPU and TPU code paths

All ruff checks pass. Let me know if you'd like any other changes!

@Pradyun92 Pradyun92 force-pushed the fix-eagle-temperature-detection branch from fb38090 to 3afb34a Compare October 17, 2025 06:15
Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Pradyun92 for catching and fixing this. The fixes look great.

Could you rebase on latest main? That should hopefully fix the doc build error.

@njhill njhill added the bug Something isn't working label Oct 17, 2025
@njhill njhill added this to the v0.11.1 milestone Oct 17, 2025
Pradyun Ramadorai added 2 commits October 17, 2025 10:41
…ctor

## Issue

Commit a676e66 changed greedy sampling temperature from -1.0 to 0.0,
but several components checking for temperature == -1 were not updated:
- TPU input batch (tpu_input_batch.py) still using -1.0
- Eagle's unused function (eagle.py) with hardcoded temperature == -1
- Rejection sampler (rejection_sampler.py) GREEDY_TEMPERATURE constant = -1
- TPU sampler (tpu/sampler.py) missing epsilon guard causing division by zero

## Impact

Affects ALL models using speculative decoding with rejection sampling
(Eagle/Eagle3, deepseek_mtp, ernie_mtp, qwen3_next_mtp, longcat_flash_mtp):
- Division by zero in rejection sampler and TPU sampler
- NaN/Inf logits
- Empty generated_token_ids
- Negative acceptance values (len([]) - 1 = -1)
- Prometheus metrics crash

## Solution

Complete the temperature migration consistently:
1. tpu_input_batch.py: Change temperature -1.0 → 0.0
2. eagle.py: Use epsilon comparison (temperature < _SAMPLING_EPS)
3. rejection_sampler.py: Change GREEDY_TEMPERATURE constant -1 → 0
4. tpu/sampler.py: Add epsilon guard before division
5. tpu/metadata.py: Add all_random property for sampler

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Issue: TPU sampler and Eagle code had two separate but related issues:
1. TPU sampler divides by zero for greedy requests (temperature=0.0)
2. Eagle code triggers mypy type errors due to missing None check

Root Cause:
- TPU sampler's apply_temperature() method lacks epsilon guard to prevent
  division by zero when temperature=0.0 (greedy sampling)
- Eagle's compute_probs_and_sample_next_token() uses temperature without
  asserting it's not None, causing mypy type errors

Impact:
- TPU: Division by zero produces NaN/Inf logits, breaking speculative
  decoding on TPU platforms for all models using Eagle/rejection sampling
- Eagle: mypy type checking failures prevent pre-commit hooks from passing

Fix:
1. TPU Sampler (vllm/v1/sample/tpu/sampler.py):
   - Add all_random parameter to apply_temperature() method
   - Add epsilon guard: if not all_random: temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp)
   - Update call site to pass sampling_metadata.all_random

2. TPU Metadata (vllm/v1/sample/tpu/metadata.py):
   - Add all_random property to TPUSupportedSamplingMetadata
   - Populate all_random from input_batch in from_input_batch()

3. Eagle (vllm/v1/spec_decode/eagle.py):
   - Add assert sampling_metadata.temperature is not None after all_greedy early return
   - Matches sampler.py pattern (line 162) for type safety

Files Modified:
- vllm/v1/sample/tpu/sampler.py: Epsilon guard in apply_temperature()
- vllm/v1/sample/tpu/metadata.py: Added all_random property
- vllm/v1/spec_decode/eagle.py: Added temperature None assertion
- CLAUDE.md: Updated modification vllm-project#11 to document fixes

This addresses PR vllm-project#27077 reviewer feedback and resolves mypy type errors.

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
@Pradyun92 Pradyun92 force-pushed the fix-eagle-temperature-detection branch from 3afb34a to fbf572f Compare October 17, 2025 14:42
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2025
Copy link
Copy Markdown
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@njhill
Copy link
Copy Markdown
Member

njhill commented Oct 17, 2025

Side note is that we clearly have some CI gaps here.

@njhill njhill merged commit acedc74 into vllm-project:main Oct 17, 2025
46 checks passed
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
adabeyta pushed a commit to adabeyta/vllm that referenced this pull request Oct 20, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…ctor (vllm-project#27077)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
MengqingCao pushed a commit to vllm-project/vllm-ascend that referenced this pull request Dec 27, 2025
### What this PR does / why we need it?
fix greedy temperature detection from
vllm-project/vllm#27077

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
fix greedy temperature detection from
vllm-project/vllm#27077

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
fix greedy temperature detection from
vllm-project/vllm#27077

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
fix greedy temperature detection from
vllm-project/vllm#27077

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants