[Refactor] Deduplicate NSA utils.py into cp_utils.py for context parallel#22914
Merged
[Refactor] Deduplicate NSA utils.py into cp_utils.py for context parallel#22914
Conversation
…llel Remove duplicated context-parallel utility functions from `layers/attention/nsa/utils.py` and consolidate them into `layers/utils/cp_utils.py`. This eliminates ~270 lines of duplicate code while preserving all functionality. Key changes: - Unify `NSAContextParallelMetadata` into `ContextParallelMetadata` - Merge `nsa_cp_metadata` field into `attn_cp_metadata` on ForwardBatch - Extend cp_utils functions with round-robin split and symmetric memory - Rename NSA-specific `can_cp_split` to `can_nsa_cp_split` - Replace `prepare_input_dp_with_cp_dsa` with `prepare_context_parallel_metadata` - Update all callers (deepseek_v2, deepseek_nextn, nsa_indexer, etc.) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
15 tasks
Collaborator
Author
|
/rerun-test test_deepseek_v32_cp_single_node.py |
Contributor
|
✅ |
Fridge003
commented
Apr 16, 2026
Collaborator
Author
|
/rerun-test test_deepseek_v32_cp_single_node.py |
Contributor
|
✅ |
prepare_context_parallel_metadata now sums prefix_len into kv_len_prev/next, so _get_topk_ragged_with_cp must not add (seq_lens_cpu - extend_seq_lens_cpu) again, otherwise get_index_k_continuous reads past the block table and triggers cudaErrorIllegalAddress once a request hits the radix prefix cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kpham-sgl
approved these changes
Apr 17, 2026
prepare_context_parallel_metadata was folding prefix_len into the int fields, guarded by `len(seqs_len) == 1`. When the scheduler packs multiple requests into one CP-extend (which happens under max_running_requests=32 + speculative_attention_mode='prefill'), that guard falls back to prefix_len=0 and the cached-prefix offset is silently dropped. Under the companion 93ccfc7 fix that removed the in-function prefix-add, _get_topk_ragged_with_cp then indexes into an extend-only K range and the indexer's ke_offset gets truncated on every prefix-cache hit, tanking GSM8K to ~0.52. Restore the pre-refactor contract: metadata stores extend-only offsets, and _get_topk_ragged_with_cp re-adds the batch-0 cached prefix via (seq_lens_cpu - extend_seq_lens_cpu) — matching the batch-0 scope of block_tables[0]. Measured TestDeepseekV32CPInSeqSplit 0.970 and TestDeepseekV32CPRoundRobinSplit 0.975 on gsm8k/200 (H200, dp=2 cp=4 / dp=1 cp=8, EAGLE spec). kv_len_prev_tensor / kv_len_next_tensor (1-D (1,) shape) stay as-is for the non-NSA qwen FA cache_seqlens path.
Fridge003
commented
Apr 18, 2026
…non-NSA CP
Per review: `_get_topk_ragged_with_cp` only runs on the NSA model path, so
the previous commit's unconditional revert to extend-only int fields broke
non-NSA CP (e.g. qwen3-moe), where FlashAttention consumes kv_len_prev
directly as cache_seqlens and needs the prefix baked in.
Split the two contracts:
- NSA CP (is_nsa_enable_prefill_cp()): keep kv_len_prev/next as extend-only;
`_get_topk_ragged_with_cp` still re-adds the cached-prefix offset from
(seq_lens_cpu - extend_seq_lens_cpu), which also handles the multi-request
CP-extend packing case where the `len(seqs_len) == 1` guard falls back to
prefix_len=0.
- Non-NSA CP: restore the pre-refactor behavior that bakes prefix_len into
kv_len_prev/next, so flash_attn_with_kvcache sees the full cache_seqlens.
Measured on H200 (gsm8k, 200 examples):
- TestDeepseekV32CPInSeqSplit 0.970 (NSA path)
- TestQwen330B 0.970 (non-NSA FA path, threshold 0.85)
Collaborator
Author
|
/rerun-stage stage-c-test-8-gpu-h200 |
Collaborator
Author
|
/rerun-stage stage-c-test-4-gpu-h100 |
Contributor
|
✅ Triggered |
Contributor
|
✅ Triggered |
Collaborator
Author
|
/rerun-stage stage-c-test-deepep-8-gpu-h200 |
Contributor
|
✅ Triggered |
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
…llel (sgl-project#22914) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kyx1999
pushed a commit
to KMSorSMS/sglang
that referenced
this pull request
Apr 27, 2026
…llel (sgl-project#22914) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
layers/attention/nsa/utils.py, consolidating them intolayers/utils/cp_utils.pyNSAContextParallelMetadataintoContextParallelMetadata(identical fields)nsa_cp_metadatafield intoattn_cp_metadataonForwardBatch(both fields were never set simultaneously)cp_utils.pyfunctions with round-robin split support and symmetric memory allocationcan_cp_split→can_nsa_cp_splitto avoid name collision with the generic versionprepare_input_dp_with_cp_dsawithprepare_context_parallel_metadata(which has better prefix_len handling)deepseek_v2.py,deepseek_nextn.py,nsa_indexer.py,forward_batch_info.py,schedule_batch.py,ascend_backend.pyTest plan
test/registered/cp/test_deepseek_v32_cp_single_node.pyon H200🤖 Generated with Claude Code