Fix hybrid_linear_attn_backend crash with ngram speculation#20739
Fix hybrid_linear_attn_backend crash with ngram speculation#20739hnyls2002 merged 5 commits intosgl-project:mainfrom
Conversation
Attention backends (hybrid_linear_attn_backend, etc.) access spec_info.topk unconditionally during target_verify, but NgramVerifyInput never sets it. This crashes at server startup when using --speculative-algo NGRAM. Add topk=1 to NgramVerifyInput since ngram speculation doesn't use tree attention (unlike Eagle which has topk>1). Fixes sgl-project#20721
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
The actual fix should be propagating speculative_eagle_topk to NgramVerifyInput. Its actually already set in
sglang/python/sglang/srt/server_args.py
Line 2987 in 9419453
Conceptually, Ngram does build a spec tree (see https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/speculative/cpp_ngram/ngram.cpp#L257 and https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/speculative/cpp_ngram/ngram.cpp#L296)
The parameters that control the tree breadth and depth are
# Tree breadth:
--speculative-ngram-min-bfs-breadth (default: 1)
--speculative-ngram-max-bfs-breadth (default: 10)
# Match window (tree depth):
--speculative-ngram-min-match-window-size (default: 1)
--speculative-ngram-max-match-window-size (default: 12)
# Other NGRAM params:
--speculative-ngram-branch-length (default: 18)
--speculative-ngram-match-type (BFS or PROB, default: BFS)… server_args hybrid_linear_attn_backend was the only attention backend accessing spec_info.topk at runtime. All other backends read topk from server_args.speculative_eagle_topk during __init__. This makes hybrid_linear_attn_backend consistent and removes the hardcoded self.topk = 1 from NgramVerifyInput that was papering over the issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
To make it consistent with some other attention backend, the simpler fix is to read directly from server args Triton instantiate its own @he-yufeng lmk what you think |
|
/tag-and-rerun-ci |
|
/rerun-test test_hybrid_attn_backend.py test_ngram_speculative_decoding.py |
|
✅ |
|
✅ |
…ect#20739) Co-authored-by: kpham-sgl <khoa.pham@radixark.ai>
Problem
hybrid_linear_attn_backendaccessesspec_info.topkat runtime duringtarget_verifymode, butNgramVerifyInputdoesn't definetopk, causing anAttributeErrorcrash with--speculative-algo NGRAM.Fix
Read
topkfromserver_args.speculative_eagle_topkat init time instead of fromspec_infoat runtime. This avoids the dependency on SpecInput subtypes all definingtopk, and is consistent with how the backend reads other config (pad_slot_id,device, etc.).For ngram,
speculative_eagle_topkis set tospeculative_ngram_max_bfs_breadthinserver_args, so tree attention branches execute correctly.Fixes #20721