fix: TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention#21649
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a batch_size field to the TRTLLMMHAMetadata class and ensures it is populated during metadata initialization to prevent issues with inflated batch sizes from DP padding. The reviewer pointed out that this initialization is missing in the init_forward_metadata_capture_cuda_graph function, which could still lead to illegal address errors when running with CUDA graphs.
| metadata = TRTLLMMHAMetadata() | ||
| seqlens_in_batch = forward_batch.seq_lens | ||
| batch_size = forward_batch.batch_size | ||
| metadata.batch_size = batch_size |
There was a problem hiding this comment.
This correctly stores the batch size for the non-CUDA graph path. However, the same logic appears to be missing for the CUDA graph path in init_forward_metadata_capture_cuda_graph.
In that function, metadata.batch_size is not set, so it will default to 0. When forward_extend is called in a CUDA graph context with DP attention, this will likely cause the same CUDA_ERROR_ILLEGAL_ADDRESS this PR aims to fix.
To ensure the fix is complete, please initialize metadata.batch_size in init_forward_metadata_capture_cuda_graph as well. For example, you could add metadata.batch_size = bs at the beginning of the function.
…ddress
When DP attention is enabled with EAGLE v2 speculative decoding,
`prepare_mlp_sync_batch` inflates `forward_batch.batch_size` to match
the max across DP ranks for MLP synchronization. However,
`init_forward_metadata` has already computed metadata tensors
(page_table, cache_seqlens, cu_seqlens_q/k) for the original,
smaller batch_size.
The TRT-LLM FMHA kernel in `forward_extend` was using the inflated
`forward_batch.batch_size`, causing it to read past the metadata tensor
boundaries. This triggers `CUDA_ERROR_ILLEGAL_ADDRESS` in
`fmhaKernels.cuh` when the kernel accesses invalid page table entries
via TMA descriptors (configured with OOB_FILL_NONE).
The fix stores `batch_size` in `TRTLLMMHAMetadata` at init time and
uses `self.forward_metadata.batch_size` in the kernel call, which
is the correct pre-padding value.
Reproduction:
- Model: Qwen3.5-397B-A17B-FP8 on 4x B200 NVL4 (GB300)
- Config: --tp 4 --dp-size=4 --enable-dp-attention
--speculative-algorithm=EAGLE --attention-backend=trtllm_mha
--enable-multimodal
- Trigger: 500+ concurrent multimodal (MMMU-Pro) requests
- Crash: CUDA_ERROR_ILLEGAL_ADDRESS at fmhaKernels.cuh:304
in _draft_extend_for_decode -> forward_extend
Verified: 1730/1730 MMMU-Pro questions completed without crash
after fix (previously crashed at ~388-405).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
67e8502 to
e26b7cd
Compare
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
Qiaolin-Yu
left a comment
There was a problem hiding this comment.
My intuition is that when padding is introduced by DP attention, some information in the forward batch becomes inconsistent with the metadata. But in this case, shouldn’t the information from the forward batch be the correct one, since it reflects the padded state? Why does this code choose to use the metadata when the two are inconsistent?
The previous commit stored batch_size in TRTLLMMHAMetadata and used it in forward_extend, but only set it in init_forward_metadata (non-CUDA-graph path). init_forward_metadata_capture_cuda_graph left it at the default 0, causing CUDA_ERROR_INVALID_VALUE during EAGLE v1 draft extend graph capture with trtllm_mha backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IIUC this is exactly the issue that causes IMA because the batch size increases which the trtllm mha kernel uses to access the data. The CI failure https://github.com/sgl-project/sglang/actions/runs/23727613783/job/69114314668 is caused by a behavior diff between spec v1 and v2. Spec V2 does not capture draft cuda graph state for trtllm mha backend, while the test uses V1. Let me retrigger the test to see whether it can work now |
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
What I mean is that there may indeed be an inconsistency here: some metadata has been padded while some has not. However, I think the correct fix would be to pad the non-padded attributes for DP attention, rather than using the non-padded version. |
|
Other attention backends derive batch size from metadata tensor shapes, while trtllm-mha backend read So I think we can use |
Use cu_seqlens_q.shape[0] - 1 to get the real batch size in forward_extend, consistent with how other attention backends work. This removes the need for a separate batch_size field on TRTLLMMHAMetadata. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5324457 to
89fdfee
Compare
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
|
/tag-and-rerun-ci |
…21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…gl-project#21649) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Summary
CUDA_ERROR_ILLEGAL_ADDRESSin TRT-LLM FMHA kernel during EAGLE v2 speculative decoding with DP attention and multimodal inputsbatch_sizeinTRTLLMMHAMetadataat init time and use it inforward_extend, instead offorward_batch.batch_sizewhich may be inflated by DP paddingRoot Cause
When DP attention is enabled with EAGLE v2,
prepare_mlp_sync_batch(forward_batch_info.py:891) inflatesforward_batch.batch_sizeto match the max across DP ranks for MLP synchronization. For example, DP0 has 10 requests butbatch_sizegets inflated to 12 to match DP1/2/3.However,
init_forward_metadatahas already computed metadata tensors (page_table,cache_seqlens,cu_seqlens_q/k) for the original batch_size of 10. The TRT-LLM FMHA kernel inforward_extend(line 874) was passing the inflatedforward_batch.batch_size=12while the metadata tensors only had 10 entries. The kernel iterates over 12 requests, reads indices 10 and 11 past the tensor boundaries, and hits unmapped GPU memory.The TMA descriptors in
fmhaKernels.cuhare configured withCU_TENSOR_MAP_FLOAT_OOB_FILL_NONE, so out-of-bounds access causes a hardCUDA_ERROR_ILLEGAL_ADDRESSrather than being clamped.Other attention backends (FlashInfer native, Triton) are immune because they derive batch_size from metadata tensor shapes rather than using
forward_batch.batch_sizeas an explicit parameter.Reproduction
--tp 4 --dp-size=4 --enable-dp-attention --speculative-algorithm=EAGLE --speculative-num-steps=3 --speculative-eagle-topk=1 --speculative-num-draft-tokens=4 --attention-backend=trtllm_mha --enable-multimodal --mamba-scheduler-strategy=extra_buffer --page-size=64CUDA_ERROR_ILLEGAL_ADDRESSatfmhaKernels.cuh:304in_draft_extend_for_decode→forward_extend, consistently at ~388-405/500 questionsStack trace (with CUDA_LAUNCH_BLOCKING=1)
Verification
After fix: 1730/1730 MMMU-Pro questions completed without crash (accuracy 78.55%), on the exact config that previously crashed at ~388-405.
Server log from crash reproduction attached
Test plan
CUDA_LAUNCH_BLOCKING=1🤖 Generated with Claude Code