Checklist
Motivation
In actual cases, the request sequence lengths are not uniform in the prefill node (4 nodes, tp1, dp32) . Consider the following two cases:
- Case 1:
each dp has extend_seq_len = [100, 2000], extend_prefix_len = [0, 0].
If we split it into [100] and [2000], the computation and communication times of the two micro-batches will differ significantly, impacting overlap efficiency.
- Case 2:
- dp0: batch_size = 1, extend_seq_len = [4000], extend_prefix_len = [0, 0].
- dp1-dp31 : batch_size = 2, extend_seq_len = [2000, 2000], extend_prefix_len = [0, 0].
In this scenario: either overlap cannot be enabled, or dp0 must construct an idle batch to enable overlap.
In both, the full potential of overlap cannot be achieved.
To address these issues, we propose two-chunk overlap, which also belongs to the scope of two-batch overlap.
In two-chunk overlap, there is a latent dependency between two micro-batches in mla computation.

We implemented two-chunk overlap in early version of sglang, which differs significantly from the current version.
By the way, we support idle batch in two-chunk overlap, which works when some dps are idle. Such as the following case:
- dp0:idle batch
- dp1-dp31: extend_seq_len=[2000]
Related resources
No response
Checklist
Motivation
In actual cases, the request sequence lengths are not uniform in the prefill node (4 nodes, tp1, dp32) . Consider the following two cases:
each dp has extend_seq_len = [100, 2000], extend_prefix_len = [0, 0].
If we split it into [100] and [2000], the computation and communication times of the two micro-batches will differ significantly, impacting overlap efficiency.
In this scenario: either overlap cannot be enabled, or dp0 must construct an idle batch to enable overlap.
In both, the full potential of overlap cannot be achieved.
To address these issues, we propose two-chunk overlap, which also belongs to the scope of two-batch overlap.
Case 1:
we split it into:
Case2:
we split it into:
In two-chunk overlap, there is a latent dependency between two micro-batches in mla computation.
We implemented two-chunk overlap in early version of sglang, which differs significantly from the current version.
By the way, we support idle batch in two-chunk overlap, which works when some dps are idle. Such as the following case:
Related resources
No response