Skip to content

[Spec] Split accept_length into num_accepted_drafts and num_accepted_tokens#23962

Merged
hnyls2002 merged 19 commits intomainfrom
lsyin/spec-split-accept-length
Apr 29, 2026
Merged

[Spec] Split accept_length into num_accepted_drafts and num_accepted_tokens#23962
hnyls2002 merged 19 commits intomainfrom
lsyin/spec-split-accept-length

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Apr 28, 2026

Follows up #23530.

Summary

  • Split the ambiguous accept_length into two explicit fields on EagleDraftInput / NgramVerifyInput: num_accepted_drafts (strict drafts-only) and num_accepted_tokens (includes the bonus token; equals drafts + 1 per req)
  • Decouple the accept_length.add_(1) in-place mutation that flipped the variable's semantics mid-function
  • Match the accept/draft naming convention from [Spec] Fix spec_accept_rate and unify accept/draft naming #23530: name contains draft → drafts-only; contains accept without draft → includes bonus

Why dual-tensor

  • Eliminates + 1 patterns scattered across attention backends and CUDA graph runners
  • Each consumer reads the field that matches its semantic, no derivation
  • Cost is one extra bs-sized int32 tensor per spec_info (~few KB), negligible

Changes

Spec info classes

  • EagleDraftInput.num_accepted_tokens: torch.Tensor and num_accepted_tokens_cpu: List[int] added alongside the drafts-only fields
  • NgramVerifyInput.num_accepted_tokens added alongside num_accepted_drafts
  • Set both fields together at every write site (verify kernel output, recompute on finish, V2 worker assignment, CUDA graph alias)

Lifecycle decoupling

  • eagle_info_v2.py:sample() returns num_accepted_drafts + 1 out-of-place instead of .add_(1) mutation
  • eagle_info.py:prepare_extend_after_decode() no longer mutates self.num_accepted_drafts; uses local extend_lens for the kernel call
  • eagle_worker_v2.py V2 path sets both fields from batch_result.accept_lens (includes bonus) and accept_lens - 1 (drafts-only)

CUDA graph runners

  • EagleDraftExtendInputBuffers and MultiLayerEagleDraftExtendInputBuffers add a parallel num_accepted_tokens buffer; copy and alias both fields during replay

Attention backends

  • aiter, flashattention, trtllm_mha, nsa, nsa_backend_mtp_precompute, wave, triton: read spec_info.num_accepted_tokens (or _cpu) directly, removing the explicit + 1

Mechanical rename

  • accept_lengthnum_accepted_drafts across spec workers, info classes, attention backends, cuda graph runners, tests
  • Local variables holding bonus-included values renamed back to accept_lens / accept_len (the rule applies to the value, not just the previous name)
  • SpeculativeMetrics.accept_length retained (vLLM-compatible metric, includes bonus)

Breaking

  • None at user-facing API level. All Prometheus metric names, meta_info keys, and CLI args unchanged.

Test plan

  • test_eagle_infer_a.py test_eagle_infer_b.py test_eagle_infer_beta.py
  • test_ngram_speculative_decoding.py test_dflash.py test_standalone_speculative_decoding.py
  • test_eagle_dp_attention.py (multi_layer_eagle_worker path)
  • DeepSeek V3.2 / NSA backend tests (covers nsa_backend_mtp_precompute extend path)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hnyls2002 hnyls2002 requested a review from Edwardf0t1 as a code owner April 28, 2026 20:57
@github-actions github-actions Bot added the blackwell SM100/SM120 label Apr 28, 2026
Base automatically changed from lsyin/spec-metrics-rename to main April 28, 2026 21:40
…pt-length

# Conflicts:
#	python/sglang/srt/managers/io_struct.py
#	python/sglang/srt/managers/scheduler_output_processor_mixin.py
#	python/sglang/srt/managers/tokenizer_manager.py
#	python/sglang/srt/managers/utils.py
#	python/sglang/srt/speculative/dflash_info.py
#	python/sglang/srt/speculative/dflash_worker.py
#	python/sglang/srt/speculative/eagle_worker.py
#	python/sglang/srt/speculative/multi_layer_eagle_worker.py
#	python/sglang/srt/speculative/ngram_info.py
#	python/sglang/srt/speculative/ngram_worker.py
@hnyls2002 hnyls2002 changed the title [Spec] Make spec_info.accept_length always drafts-only; rename to num_accepted_drafts [Spec] Split accept_length into num_accepted_drafts and num_accepted_tokens Apr 28, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h20

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-h100

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies). View workflow run

@hnyls2002 hnyls2002 merged commit bd448e5 into main Apr 29, 2026
181 of 210 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/spec-split-accept-length branch April 29, 2026 07:02
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant