[Bug Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1#23037
[Bug Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1#23037Qiaolin-Yu merged 12 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the eagle_worker.py file to explicitly clone the topk_p, topk_index, and hidden_states tensors from logits_output before assigning them to forward_batch.spec_info. This ensures that the speculative information maintains independent copies of these tensors, preventing potential issues related to shared memory or in-place modifications. I have no feedback to provide as no review comments were submitted.
|
#15457 seems to mention this bug as well. |
JustinTong0323
left a comment
There was a problem hiding this comment.
Nice catch on the aliasing + stale padding. Two comments inline; also worth squashing the commits (Update eagle_draft_cuda_graph_runner.py etc.) before merge.
| buffers.topk_p.zero_() | ||
| buffers.topk_index.zero_() | ||
| buffers.hidden_states.zero_() | ||
| buffers.req_pool_indices.zero_() |
There was a problem hiding this comment.
I think this is correct but unsure about extra overhead
There was a problem hiding this comment.
Only hits the padding branch (bs != raw_bs), so the cost is bounded by padding size — hidden_states memset dominates and it's bandwidth-bound, one shot per replay. The alternative (stale OOB indices in padded slots) is what triggered the original IMA, so I'd rather eat this than make every downstream gather padding-aware.
Could tighten to a [raw_bs:bs] tail-only memset as a follow-up if it shows up in traces — happy to do that in a separate PR.
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
| buffers.positions.zero_() | ||
| buffers.topk_p.zero_() | ||
| buffers.topk_index.zero_() | ||
| buffers.hidden_states.zero_() |
There was a problem hiding this comment.
The index tensor needs to be reset to 0 makes sense to me, since the padding part might cause IMA. But why do the hidden states need to be reset? My understanding is that the padding portion will be discarded anyway, so it shouldn’t have any impact at all.
There was a problem hiding this comment.
The reason I still zero hidden_states is about what happens inside the captured graph, not the final output:
The graph is captured at padded shape ( bs , not raw_bs ). The captured kernels — RMSNorm, the EagleDraftInput projection, the attention that consumes hidden_states — read full bs rows, not a sliced view. Garbage on padded rows is fine on its own.
But if the previous replay left NaN/Inf in those padded slots (which does happen on GLM-5.1 + PD + DP + MTP — that's how I tracked this bug down), the captured RMSNorm reduces over a NaN row and contaminates scratch / shared buffers that valid lanes read in some configs. The padded output is discarded; the side effects on shared scratch inside the captured graph are not.
Under long-running PD + DP + MTP, decode crashes with
vectorized_gather_kernel: index out of bounds.Two paths feed garbage into the next cuda graph replay:
Shared buffer view.
logits_outputfields are views intothe cuda graph's pre-allocated output buffers. Assigning them
to
spec_infodirectly means the nextreplay()overwritesthe data
spec_infostill holds.Padded region staleness.
EAGLEDraftCudaGraphRunneronlyoverwrites
[:raw_bs]of its input buffers before replay.Padded slots
[raw_bs:bs]retain stale values(e.g.
topk_index = -4886055425978319163) and get gatheredas OOB indices inside the graph.
All guards are no-ops on healthy data.