Fix NSA FA3 shape mismatch under DP Attention + EAGLE by junliu-mde · Pull Request #24235 · sgl-project/sglang

junliu-mde · 2026-05-01T17:34:43Z

Summary

Defer NSA draft attention metadata initialization on DP-attention EAGLE eager paths until after MLP-sync padding is applied.
Track the real pre-padding batch/token sizes and initialize NSA metadata from a real-batch view while keeping the padded tensors for downstream DP sync.
Make FA3 decode consume only real metadata rows and restore a padded output tail for shape compatibility.

Test plan

aligned with NSA FA3 crash with DP Attention on padded speculative batches #24233:
- Start latest main with NSA + DP Attention + EAGLE/NextN (--tp 8 --dp 4 --enable-dp-attention --cuda-graph-max-bs 2 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --page-size 1).
- Vanilla latest main + python3 test-dpa-repro.py http://localhost:30000 1v3 reproduces RuntimeError: batch_size must be equal to batch_size_k.
- Patched latest main + the same test-dpa-repro.py ... 1v3 client returns Total: 4/4 ok, 0 failed after warmup. The local repro script exits non-zero on the fixed path because it encodes "crash confirmed" as success.
- Verified patched backend logs contain 0 batch_size must be equal, 0 FA3 metadata/page table mismatch, 0 scheduler exception, 0 HTTP 500, and 0 traceback/assertion.

gemini-code-assist

Code Review

This pull request introduces deferred attention metadata initialization to handle Data Parallel (DP) attention padding, specifically for Native Sparse Attention (NSA) and FlashAttention-3 (FA3) backends. By tracking the original batch dimensions before padding, the system ensures that attention kernels operate on the correct token subsets while preserving the padded shapes needed for downstream MLP synchronization. Feedback suggests refining the backend type check in ModelRunner to be more comprehensive and consistent with other parts of the codebase.

gemini-code-assist · 2026-05-01T17:37:39Z

+            if forward_batch.input_ids is not None
+            else 0
+        )
+        if attn_backend.__class__.__name__ != "NativeSparseAttnBackend":


The check attn_backend.__class__.__name__ != "NativeSparseAttnBackend" is a bit brittle as it only checks for one specific class name. A similar check in python/sglang/srt/speculative/eagle_worker.py uses a helper function _is_native_sparse_attn_backend which checks for both NativeSparseAttnBackend and NativeSparseAttnMultiStepBackend. To improve robustness and maintainability, consider making this check more comprehensive by checking against a set of class names. This would make the code more resilient to future changes, such as if a wrapper class is introduced.

Suggested change

if attn_backend.__class__.__name__ != "NativeSparseAttnBackend":

if attn_backend.__class__.__name__ not in ("NativeSparseAttnBackend", "NativeSparseAttnMultiStepBackend"):

Is it necessary?

Keep NSA imports at module scope and remove ambiguous or unused local variables so the touched files pass ruff.

Align NSA metadata padding with DP attention token padding and guard empty DP ranks before FA3 and DeepGEMM metadata paths.

The scheduler_metadata buffer precomputed in `_compute_scheduler_metadata` (introduced by PR sgl-project#21104 to avoid per-layer `prepare_varlen_num_blocks`) can become inconsistent with the `num_splits` the C++ `mha_fwd` kernel derives from the live `cache_seqlens` once decode advances. The mismatch triggers an out-of-bounds read in the FA3 split-KV combine kernel and surfaces as a CUDA illegal-memory-access at `flash_fwd_combine_launch_template.h:52`. Reproduces with Qwen3-0.6B + `--enable-dp-attention --dp 8 --tp 8 --chunked-prefill-size 131072` on H200 after ~65 decode steps. Single-GPU and TP-only paths are unaffected. Skip the precompute when DP attention is on and let the C++ kernel recompute its own metadata per layer. PR sgl-project#21104's optimization is preserved on every other path. PR sgl-project#24235 had previously addressed a narrower variant on NSA + EAGLE. Co-authored-by: Cursor <cursoragent@cursor.com>

The scheduler_metadata buffer precomputed in `_compute_scheduler_metadata` (introduced by PR sgl-project#21104 to avoid per-layer `prepare_varlen_num_blocks`) can become inconsistent with the `num_splits` the C++ `mha_fwd` kernel derives from the live `cache_seqlens` once decode advances. The mismatch triggers an out-of-bounds read in the FA3 split-KV combine kernel and surfaces as a CUDA illegal-memory-access at `flash_fwd_combine_launch_template.h:52`. Reproduces with Qwen3-0.6B + `--enable-dp-attention --dp 8 --tp 8 --chunked-prefill-size 131072` on H200 after ~65 decode steps. Single-GPU and TP-only paths are unaffected. Skip the precompute when DP attention is on and let the C++ kernel recompute its own metadata per layer. PR sgl-project#21104's optimization is preserved on every other path. PR sgl-project#24235 had previously addressed a narrower variant on NSA + EAGLE.

junliu-mde requested review from Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, hebiao064, hnyls2002, ispobock and merrymercy as code owners May 1, 2026 17:34

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

junliu-mde mentioned this pull request May 1, 2026

NSA FA3 crash with DP Attention on padded speculative batches #24233

Open

ispobock assigned Qiaolin-Yu May 2, 2026

junliu-mde requested review from 1am9trash, hlu1, hubertlu-tw and kkHuang-amd as code owners May 5, 2026 17:10

junliu-mde added 2 commits May 7, 2026 21:28

Fix NSA linter issues

f28f27c

Keep NSA imports at module scope and remove ambiguous or unused local variables so the touched files pass ruff.

Fix NSA FA3 DP padding metadata

f2e386d

Align NSA metadata padding with DP attention token padding and guard empty DP ranks before FA3 and DeepGEMM metadata paths.

junliu-mde force-pushed the fix/nsa-fa3-dpa-eagle-clean branch from 3545ca3 to f2e386d Compare May 7, 2026 12:29

YAMY1234 mentioned this pull request May 7, 2026

fix(fa3): skip scheduler_metadata precompute under DP attention #24632

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NSA FA3 shape mismatch under DP Attention + EAGLE#24235

Fix NSA FA3 shape mismatch under DP Attention + EAGLE#24235
junliu-mde wants to merge 2 commits intosgl-project:mainfrom
junliu-mde:fix/nsa-fa3-dpa-eagle-clean

junliu-mde commented May 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Uh oh!

junliu-mde May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if attn_backend.__class__.__name__ != "NativeSparseAttnBackend":
	if attn_backend.__class__.__name__ not in ("NativeSparseAttnBackend", "NativeSparseAttnMultiStepBackend"):

Conversation

junliu-mde commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

junliu-mde May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

junliu-mde commented May 1, 2026 •

edited

Loading