Skip to content

[Bug Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1#23037

Merged
Qiaolin-Yu merged 12 commits intosgl-project:mainfrom
zRzRzRzRzRzRzR:glm-bugfix
May 1, 2026
Merged

[Bug Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1#23037
Qiaolin-Yu merged 12 commits intosgl-project:mainfrom
zRzRzRzRzRzRzR:glm-bugfix

Conversation

@zRzRzRzRzRzRzR
Copy link
Copy Markdown
Contributor

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR commented Apr 17, 2026

Under long-running PD + DP + MTP, decode crashes with
vectorized_gather_kernel: index out of bounds.

Two paths feed garbage into the next cuda graph replay:

  1. Shared buffer view. logits_output fields are views into
    the cuda graph's pre-allocated output buffers. Assigning them
    to spec_info directly means the next replay() overwrites
    the data spec_info still holds.

  2. Padded region staleness. EAGLEDraftCudaGraphRunner only
    overwrites [:raw_bs] of its input buffers before replay.
    Padded slots [raw_bs:bs] retain stale values
    (e.g. topk_index = -4886055425978319163) and get gathered
    as OOB indices inside the graph.

All guards are no-ops on healthy data.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the eagle_worker.py file to explicitly clone the topk_p, topk_index, and hidden_states tensors from logits_output before assigning them to forward_batch.spec_info. This ensures that the speculative information maintains independent copies of these tensors, preventing potential issues related to shared memory or in-place modifications. I have no feedback to provide as no review comments were submitted.

@zRzRzRzRzRzRzR
Copy link
Copy Markdown
Contributor Author

#15457 seems to mention this bug as well.

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR changed the title [GLM-5.1] Use clone for logits output for MTP layers [Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1 Apr 20, 2026
Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the aliasing + stale padding. Two comments inline; also worth squashing the commits (Update eagle_draft_cuda_graph_runner.py etc.) before merge.

Comment thread python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py Outdated
Comment thread python/sglang/srt/speculative/eagle_worker.py Outdated
@zRzRzRzRzRzRzR zRzRzRzRzRzRzR changed the title [Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1 [Bug Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1 Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This likely fix #22096 (which happen very rarely) as well.

Any chances you have a reproducible script to make sure this fixes the bug.

Comment thread python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py Outdated
Comment thread python/sglang/srt/speculative/eagle_worker.py Outdated
Comment on lines +387 to +390
buffers.topk_p.zero_()
buffers.topk_index.zero_()
buffers.hidden_states.zero_()
buffers.req_pool_indices.zero_()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct but unsure about extra overhead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only hits the padding branch (bs != raw_bs), so the cost is bounded by padding size — hidden_states memset dominates and it's bandwidth-bound, one shot per replay. The alternative (stale OOB indices in padded slots) is what triggered the original IMA, so I'd rather eat this than make every downstream gather padding-aware.

Could tighten to a [raw_bs:bs] tail-only memset as a follow-up if it shows up in traces — happy to do that in a separate PR.

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

buffers.positions.zero_()
buffers.topk_p.zero_()
buffers.topk_index.zero_()
buffers.hidden_states.zero_()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index tensor needs to be reset to 0 makes sense to me, since the padding part might cause IMA. But why do the hidden states need to be reset? My understanding is that the padding portion will be discarded anyway, so it shouldn’t have any impact at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I still zero hidden_states is about what happens inside the captured graph, not the final output:

The graph is captured at padded shape ( bs , not raw_bs ). The captured kernels — RMSNorm, the EagleDraftInput projection, the attention that consumes hidden_states — read full bs rows, not a sliced view. Garbage on padded rows is fine on its own.

But if the previous replay left NaN/Inf in those padded slots (which does happen on GLM-5.1 + PD + DP + MTP — that's how I tracked this bug down), the captured RMSNorm reduces over a NaN row and contaminates scratch / shared buffers that valid lanes read in some configs. The padded output is discarded; the side effects on shared scratch inside the captured graph are not.

@Qiaolin-Yu Qiaolin-Yu merged commit 79bc250 into sgl-project:main May 1, 2026
388 of 453 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants