Skip to content

[Feature] two-chunk overlap for DeepSeekV3/R1 #6328

@House-West

Description

@House-West

Checklist

Motivation

In actual cases, the request sequence lengths are not uniform in the prefill node (4 nodes, tp1, dp32) . Consider the following two cases:

  • Case 1:
    each dp has extend_seq_len = [100, 2000], extend_prefix_len = [0, 0].
    If we split it into [100] and [2000], the computation and communication times of the two micro-batches will differ significantly, impacting overlap efficiency.
    
  • Case 2:
    • dp0: batch_size = 1, extend_seq_len = [4000], extend_prefix_len = [0, 0].
    • dp1-dp31 : batch_size = 2, extend_seq_len = [2000, 2000], extend_prefix_len = [0, 0].
      In this scenario: either overlap cannot be enabled, or dp0 must construct an idle batch to enable overlap.
      In both, the full potential of overlap cannot be achieved.
      

To address these issues, we propose two-chunk overlap, which also belongs to the scope of two-batch overlap.

  • Case 1:
    we split it into:

    • extend_seq_len0 = [100, 950], extend_prefix_len0 = [0, 0]
    • extend_seq_len1 = [1050], extend_prefix_len1 = [950]
  • Case2:
    we split it into:

    • extend_seq_len0=[2000], extend_prefix_len0=[0]
    • extend_seq_len1=[2000], extend_prefix_len1=[2000]

In two-chunk overlap, there is a latent dependency between two micro-batches in mla computation.

Image

We implemented two-chunk overlap in early version of sglang, which differs significantly from the current version.

By the way, we support idle batch in two-chunk overlap, which works when some dps are idle. Such as the following case:

  • dp0:idle batch
  • dp1-dp31: extend_seq_len=[2000]

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions