Support batch size > 1 when enable CP by Shunkangz · Pull Request #23269 · sgl-project/sglang

Shunkangz · 2026-04-20T14:23:46Z

Motivation

Enable batch size > 1 with context parallel.

Modifications

The main modification is the context_parallel_metadata for attention.

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-20T14:23:51Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-04-20T14:24:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Shunkangz · 2026-04-30T02:29:18Z

/tag-run-ci-label

Shunkangz · 2026-04-30T02:31:43Z

/tag-and-rerun-ci

Shunkangz · 2026-05-05T14:37:26Z

/rerun-failed-ci

kpham-sgl

@Shunkangz Thank you for the contribution, I left some reviews. There will be another round of more thorough review. We can discuss in the slack channel as well.
Couple of things I want to call out:

During development of MLA CP, I found out two notable bugs in current MHA CP impl. 1 is

sglang/python/sglang/srt/layers/attention/flashattention_backend.py

Lines 492 to 508 in 5466e9f

    
           # MLA/MHA CP: prepare_mlp_sync_batch pads extend tokens up to 
        
           # lcm(attn_tp_size, attn_cp_size), so cache_seqlens_cp can exceed 
        
           # seq_lens_cpu.max(). Widen page_table by the pad delta to keep 
        
           # FA3's causal reads in-bounds; widened columns index KV slot 0 
        
           # (req_to_token is zero-init) and outputs for padding queries are 
        
           # discarded downstream. 
        
           if ( 
        
               self.attn_cp_size > 1 
        
               and forward_batch.global_num_tokens_cpu is not None 
        
               and forward_batch.extend_num_tokens is not None 
        
               and forward_batch.extend_seq_lens_cpu is not None 
        
           ): 
        
               padded_extend = int(forward_batch.extend_num_tokens) 
        
               real_extend = int(sum(forward_batch.extend_seq_lens_cpu)) 
        
               pad_delta = padded_extend - real_extend 
        
               if pad_delta > 0: 
        
                   metadata.max_seq_len_k += pad_delta

and 2 is

sglang/python/sglang/srt/layers/utils/cp_utils.py

Lines 471 to 472 in 5466e9f

    
           # Derive prefix offset from unpadded CPU tensors. Both `seqs_len` and `extend_lens` are unpadded by the caller 
        
           # Using the padded `kv_len` here would undercount `prefix_len` by the padding amount and shift the FA causal horizon.

We need MHA CP zigzag mode test for attn_cp_size == 4 and bs > 1

Shunkangz · 2026-05-06T03:26:45Z

@Shunkangz Thank you for the contribution, I left some reviews. There will be another round of more thorough review. We can discuss in the slack channel as well. Couple of things I want to call out:

During development of MLA CP, I found out two notable bugs in current MHA CP impl. 1 is

sglang/python/sglang/srt/layers/attention/flashattention_backend.py

Lines 492 to 508 in 5466e9f

# MLA/MHA CP: prepare_mlp_sync_batch pads extend tokens up to

# lcm(attn_tp_size, attn_cp_size), so cache_seqlens_cp can exceed

# seq_lens_cpu.max(). Widen page_table by the pad delta to keep

# FA3's causal reads in-bounds; widened columns index KV slot 0

# (req_to_token is zero-init) and outputs for padding queries are

# discarded downstream.

if (

self.attn_cp_size > 1

and forward_batch.global_num_tokens_cpu is not None

and forward_batch.extend_num_tokens is not None

and forward_batch.extend_seq_lens_cpu is not None

):

padded_extend = int(forward_batch.extend_num_tokens)

real_extend = int(sum(forward_batch.extend_seq_lens_cpu))

pad_delta = padded_extend - real_extend

if pad_delta > 0:

metadata.max_seq_len_k += pad_delta

and 2 is

sglang/python/sglang/srt/layers/utils/cp_utils.py

Lines 471 to 472 in 5466e9f

# Derive prefix offset from unpadded CPU tensors. Both `seqs_len` and `extend_lens` are unpadded by the caller

# Using the padded `kv_len` here would undercount `prefix_len` by the padding amount and shift the FA causal horizon.

We need MHA CP zigzag mode test for attn_cp_size == 4 and bs > 1

Thank you for pointing this out. In this PR, I only want to support the bs > 1 with GQA model. For MLA CP part, I left it as original implementation. I believe that the MLA CP should be refactored and aligned with our existing logic such as args, layer communicator and so on.

kpham-sgl · 2026-05-06T04:52:17Z

@Shunkangz Thank you for the contribution, I left some reviews. There will be another round of more thorough review. We can discuss in the slack channel as well. Couple of things I want to call out:

During development of MLA CP, I found out two notable bugs in current MHA CP impl. 1 is

sglang/python/sglang/srt/layers/attention/flashattention_backend.py

Lines 492 to 508 in 5466e9f

# MLA/MHA CP: prepare_mlp_sync_batch pads extend tokens up to

# lcm(attn_tp_size, attn_cp_size), so cache_seqlens_cp can exceed

# seq_lens_cpu.max(). Widen page_table by the pad delta to keep

# FA3's causal reads in-bounds; widened columns index KV slot 0

# (req_to_token is zero-init) and outputs for padding queries are

# discarded downstream.

if (

self.attn_cp_size > 1

and forward_batch.global_num_tokens_cpu is not None

and forward_batch.extend_num_tokens is not None

and forward_batch.extend_seq_lens_cpu is not None

):

padded_extend = int(forward_batch.extend_num_tokens)

real_extend = int(sum(forward_batch.extend_seq_lens_cpu))

pad_delta = padded_extend - real_extend

if pad_delta > 0:

metadata.max_seq_len_k += pad_delta

and 2 is

sglang/python/sglang/srt/layers/utils/cp_utils.py

Lines 471 to 472 in 5466e9f

# Derive prefix offset from unpadded CPU tensors. Both `seqs_len` and `extend_lens` are unpadded by the caller

# Using the padded `kv_len` here would undercount `prefix_len` by the padding amount and shift the FA causal horizon.

We need MHA CP zigzag mode test for attn_cp_size == 4 and bs > 1

Thank you for pointing this out. In this PR, I only want to support the bs > 1 with GQA model. For MLA CP part, I left it as original implementation. I believe that the MLA CP should be refactored and aligned with our existing logic such as args, layer communicator and so on.

Ah sorry I should be clearer here:

is an issue about the padding happening in prepare_mlp_sync_batch causing cache_seqlens_cp to go over seq_lens_cpu.max(). This bug occur also for MHA CP and will likely impact this PR.
is an issue about padded kv_len messing up metadata computation. I think we agree on this already Support batch size > 1 when enable CP #23269 (comment)

Shunkangz · 2026-05-06T05:35:23Z

@Shunkangz Thank you for the contribution, I left some reviews. There will be another round of more thorough review. We can discuss in the slack channel as well. Couple of things I want to call out:

During development of MLA CP, I found out two notable bugs in current MHA CP impl. 1 is

sglang/python/sglang/srt/layers/attention/flashattention_backend.py

Lines 492 to 508 in 5466e9f

# MLA/MHA CP: prepare_mlp_sync_batch pads extend tokens up to

# lcm(attn_tp_size, attn_cp_size), so cache_seqlens_cp can exceed

# seq_lens_cpu.max(). Widen page_table by the pad delta to keep

# FA3's causal reads in-bounds; widened columns index KV slot 0

# (req_to_token is zero-init) and outputs for padding queries are

# discarded downstream.

if (

self.attn_cp_size > 1

and forward_batch.global_num_tokens_cpu is not None

and forward_batch.extend_num_tokens is not None

and forward_batch.extend_seq_lens_cpu is not None

):

padded_extend = int(forward_batch.extend_num_tokens)

real_extend = int(sum(forward_batch.extend_seq_lens_cpu))

pad_delta = padded_extend - real_extend

if pad_delta > 0:

metadata.max_seq_len_k += pad_delta

and 2 is

sglang/python/sglang/srt/layers/utils/cp_utils.py

Lines 471 to 472 in 5466e9f

# Derive prefix offset from unpadded CPU tensors. Both `seqs_len` and `extend_lens` are unpadded by the caller

# Using the padded `kv_len` here would undercount `prefix_len` by the padding amount and shift the FA causal horizon.

We need MHA CP zigzag mode test for attn_cp_size == 4 and bs > 1

Thank you for pointing this out. In this PR, I only want to support the bs > 1 with GQA model. For MLA CP part, I left it as original implementation. I believe that the MLA CP should be refactored and aligned with our existing logic such as args, layer communicator and so on.

Ah sorry I should be clearer here:

is an issue about the padding happening in prepare_mlp_sync_batch causing cache_seqlens_cp to go over seq_lens_cpu.max(). This bug occur also for MHA CP and will likely impact this PR.

is an issue about padded kv_len messing up metadata computation. I think we agree on this already Support batch size > 1 when enable CP #23269 (comment)

For 1, let's discuss in details through slack. For 2, I think that the existing TestContextParallelMetadata already cover this. Can you confirm it?

kpham-sgl · 2026-05-06T05:56:38Z

@Shunkangz Thank you for the contribution, I left some reviews. There will be another round of more thorough review. We can discuss in the slack channel as well. Couple of things I want to call out:

During development of MLA CP, I found out two notable bugs in current MHA CP impl. 1 is

sglang/python/sglang/srt/layers/attention/flashattention_backend.py

Lines 492 to 508 in 5466e9f

# MLA/MHA CP: prepare_mlp_sync_batch pads extend tokens up to

# lcm(attn_tp_size, attn_cp_size), so cache_seqlens_cp can exceed

# seq_lens_cpu.max(). Widen page_table by the pad delta to keep

# FA3's causal reads in-bounds; widened columns index KV slot 0

# (req_to_token is zero-init) and outputs for padding queries are

# discarded downstream.

if (

self.attn_cp_size > 1

and forward_batch.global_num_tokens_cpu is not None

and forward_batch.extend_num_tokens is not None

and forward_batch.extend_seq_lens_cpu is not None

):

padded_extend = int(forward_batch.extend_num_tokens)

real_extend = int(sum(forward_batch.extend_seq_lens_cpu))

pad_delta = padded_extend - real_extend

if pad_delta > 0:

metadata.max_seq_len_k += pad_delta

and 2 is

sglang/python/sglang/srt/layers/utils/cp_utils.py

Lines 471 to 472 in 5466e9f

# Derive prefix offset from unpadded CPU tensors. Both `seqs_len` and `extend_lens` are unpadded by the caller

# Using the padded `kv_len` here would undercount `prefix_len` by the padding amount and shift the FA causal horizon.

We need MHA CP zigzag mode test for attn_cp_size == 4 and bs > 1

Thank you for pointing this out. In this PR, I only want to support the bs > 1 with GQA model. For MLA CP part, I left it as original implementation. I believe that the MLA CP should be refactored and aligned with our existing logic such as args, layer communicator and so on.

Ah sorry I should be clearer here:

is an issue about the padding happening in prepare_mlp_sync_batch causing cache_seqlens_cp to go over seq_lens_cpu.max(). This bug occur also for MHA CP and will likely impact this PR.

is an issue about padded kv_len messing up metadata computation. I think we agree on this already Support batch size > 1 when enable CP #23269 (comment)

For 1, let's discuss in details through slack. For 2, I think that the existing TestContextParallelMetadata already cover this. Can you confirm it?

Yes let's discuss 1 further in slack tomorrow. For 2, sorry what test is this?

Shunkangz · 2026-05-06T06:02:55Z

We need MHA CP zigzag mode test for attn_cp_size == 4 and bs > 1

I mean the TestContextParallelMetadata might already cover this.

Shunkangz · 2026-05-07T09:29:05Z

/rerun-failed-ci

Shunkangz · 2026-05-08T02:21:21Z

/rerun-failed-ci

Shunkangz marked this pull request as ready for review April 20, 2026 14:24

Shunkangz requested review from 1am9trash, BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, fzyzcjy, hlu1, hubertlu-tw, ispobock, kkHuang-amd and merrymercy as code owners April 20, 2026 14:24

github-actions Bot added the deepseek label Apr 20, 2026

Fridge003 mentioned this pull request Apr 20, 2026

[Roadmap] Context Parallelism (2026 Q2) #21788

Open

15 tasks

kpham-sgl self-assigned this Apr 21, 2026

Add bs > 1

e98e47b

Shunkangz force-pushed the cp_multi_batch branch from 0ef797e to e98e47b Compare April 27, 2026 02:56

Verify bs > 1

89c594c

Shunkangz requested review from hnyls2002 and xiezhq-hermann as code owners April 30, 2026 02:26

github-actions Bot added the run-ci label Apr 30, 2026

kpham-sgl requested changes May 6, 2026

View reviewed changes

Address comments

981e0c0

Pad seq to 2 * cp

2f27b0d

Merge branch 'main' into cp_multi_batch

bc36ef3

	# MLA/MHA CP: prepare_mlp_sync_batch pads extend tokens up to
	# lcm(attn_tp_size, attn_cp_size), so cache_seqlens_cp can exceed
	# seq_lens_cpu.max(). Widen page_table by the pad delta to keep
	# FA3's causal reads in-bounds; widened columns index KV slot 0
	# (req_to_token is zero-init) and outputs for padding queries are
	# discarded downstream.
	if (
	self.attn_cp_size > 1
	and forward_batch.global_num_tokens_cpu is not None
	and forward_batch.extend_num_tokens is not None
	and forward_batch.extend_seq_lens_cpu is not None
	):
	padded_extend = int(forward_batch.extend_num_tokens)
	real_extend = int(sum(forward_batch.extend_seq_lens_cpu))
	pad_delta = padded_extend - real_extend
	if pad_delta > 0:
	metadata.max_seq_len_k += pad_delta

	# Derive prefix offset from unpadded CPU tensors. Both `seqs_len` and `extend_lens` are unpadded by the caller
	# Using the padded `kv_len` here would undercount `prefix_len` by the padding amount and shift the FA causal horizon.

Conversation

Shunkangz commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot commented Apr 20, 2026

Uh oh!

Shunkangz commented Apr 30, 2026

Uh oh!

Shunkangz commented Apr 30, 2026

Uh oh!

Shunkangz commented May 5, 2026

Uh oh!

kpham-sgl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shunkangz commented May 6, 2026

Uh oh!

kpham-sgl commented May 6, 2026

Uh oh!

Shunkangz commented May 6, 2026

Uh oh!

kpham-sgl commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shunkangz commented May 6, 2026

Uh oh!

Shunkangz commented May 7, 2026

Uh oh!

Shunkangz commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shunkangz commented Apr 20, 2026 •

edited

Loading

kpham-sgl left a comment •

edited

Loading

kpham-sgl commented May 6, 2026 •

edited

Loading