Skip to content

Fix NSA FA3 shape mismatch under DP Attention + EAGLE#24235

Open
junliu-mde wants to merge 2 commits intosgl-project:mainfrom
junliu-mde:fix/nsa-fa3-dpa-eagle-clean
Open

Fix NSA FA3 shape mismatch under DP Attention + EAGLE#24235
junliu-mde wants to merge 2 commits intosgl-project:mainfrom
junliu-mde:fix/nsa-fa3-dpa-eagle-clean

Conversation

@junliu-mde
Copy link
Copy Markdown
Contributor

@junliu-mde junliu-mde commented May 1, 2026

Summary

  • Defer NSA draft attention metadata initialization on DP-attention EAGLE eager paths until after MLP-sync padding is applied.
  • Track the real pre-padding batch/token sizes and initialize NSA metadata from a real-batch view while keeping the padded tensors for downstream DP sync.
  • Make FA3 decode consume only real metadata rows and restore a padded output tail for shape compatibility.

fix #24233

Test plan

  • aligned with NSA FA3 crash with DP Attention on padded speculative batches #24233:
    • Start latest main with NSA + DP Attention + EAGLE/NextN (--tp 8 --dp 4 --enable-dp-attention --cuda-graph-max-bs 2 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --page-size 1).
    • Vanilla latest main + python3 test-dpa-repro.py http://localhost:30000 1v3 reproduces RuntimeError: batch_size must be equal to batch_size_k.
    • Patched latest main + the same test-dpa-repro.py ... 1v3 client returns Total: 4/4 ok, 0 failed after warmup. The local repro script exits non-zero on the fixed path because it encodes "crash confirmed" as success.
    • Verified patched backend logs contain 0 batch_size must be equal, 0 FA3 metadata/page table mismatch, 0 scheduler exception, 0 HTTP 500, and 0 traceback/assertion.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces deferred attention metadata initialization to handle Data Parallel (DP) attention padding, specifically for Native Sparse Attention (NSA) and FlashAttention-3 (FA3) backends. By tracking the original batch dimensions before padding, the system ensures that attention kernels operate on the correct token subsets while preserving the padded shapes needed for downstream MLP synchronization. Feedback suggests refining the backend type check in ModelRunner to be more comprehensive and consistent with other parts of the codebase.

if forward_batch.input_ids is not None
else 0
)
if attn_backend.__class__.__name__ != "NativeSparseAttnBackend":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check attn_backend.__class__.__name__ != "NativeSparseAttnBackend" is a bit brittle as it only checks for one specific class name. A similar check in python/sglang/srt/speculative/eagle_worker.py uses a helper function _is_native_sparse_attn_backend which checks for both NativeSparseAttnBackend and NativeSparseAttnMultiStepBackend. To improve robustness and maintainability, consider making this check more comprehensive by checking against a set of class names. This would make the code more resilient to future changes, such as if a wrapper class is introduced.

Suggested change
if attn_backend.__class__.__name__ != "NativeSparseAttnBackend":
if attn_backend.__class__.__name__ not in ("NativeSparseAttnBackend", "NativeSparseAttnMultiStepBackend"):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary?

junliu-mde added 2 commits May 7, 2026 21:28
Keep NSA imports at module scope and remove ambiguous or unused local variables so the touched files pass ruff.
Align NSA metadata padding with DP attention token padding and guard empty DP ranks before FA3 and DeepGEMM metadata paths.
@junliu-mde junliu-mde force-pushed the fix/nsa-fa3-dpa-eagle-clean branch from 3545ca3 to f2e386d Compare May 7, 2026 12:29
YAMY1234 added a commit to YAMY1234/sglang that referenced this pull request May 7, 2026
The scheduler_metadata buffer precomputed in `_compute_scheduler_metadata`
(introduced by PR sgl-project#21104 to avoid per-layer `prepare_varlen_num_blocks`)
can become inconsistent with the `num_splits` the C++ `mha_fwd` kernel
derives from the live `cache_seqlens` once decode advances. The mismatch
triggers an out-of-bounds read in the FA3 split-KV combine kernel and
surfaces as a CUDA illegal-memory-access at
`flash_fwd_combine_launch_template.h:52`.

Reproduces with Qwen3-0.6B + `--enable-dp-attention --dp 8 --tp 8
--chunked-prefill-size 131072` on H200 after ~65 decode steps. Single-GPU
and TP-only paths are unaffected.

Skip the precompute when DP attention is on and let the C++ kernel
recompute its own metadata per layer. PR sgl-project#21104's optimization is
preserved on every other path. PR sgl-project#24235 had previously addressed a
narrower variant on NSA + EAGLE.

Co-authored-by: Cursor <cursoragent@cursor.com>
YAMY1234 added a commit to YAMY1234/sglang that referenced this pull request May 7, 2026
The scheduler_metadata buffer precomputed in `_compute_scheduler_metadata`
(introduced by PR sgl-project#21104 to avoid per-layer `prepare_varlen_num_blocks`)
can become inconsistent with the `num_splits` the C++ `mha_fwd` kernel
derives from the live `cache_seqlens` once decode advances. The mismatch
triggers an out-of-bounds read in the FA3 split-KV combine kernel and
surfaces as a CUDA illegal-memory-access at
`flash_fwd_combine_launch_template.h:52`.

Reproduces with Qwen3-0.6B + `--enable-dp-attention --dp 8 --tp 8
--chunked-prefill-size 131072` on H200 after ~65 decode steps. Single-GPU
and TP-only paths are unaffected.

Skip the precompute when DP attention is on and let the C++ kernel
recompute its own metadata per layer. PR sgl-project#21104's optimization is
preserved on every other path. PR sgl-project#24235 had previously addressed a
narrower variant on NSA + EAGLE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NSA FA3 crash with DP Attention on padded speculative batches

2 participants