Skip to content

[Feature] Support sequence parallel on dense model & mlp layer #10519

@JiaruiChang5268

Description

@JiaruiChang5268

Checklist

Motivation

1. Motivation

The primary motivations for introducing SP are:

  1. Reducing redundant LayerNorm computations

    • In pure TP, LayerNorm (LN) is replicated across all tensor-parallel ranks.
    • With SP, LN can be partitioned along the sequence dimension, avoiding duplicated LN computation and improving efficiency.
  2. Enabling fused operator design

    • SP lays the groundwork for more effective kernel fusion strategies (e.g., combining LN, residual add, bias, activation).
    • This leads to better devices utilization and lower kernel launch overhead.
  3. Better alignment with multi-dimensional parallelism

    • SP works as a natural complement to TP/EP (Expert Parallelism), especially in cases where EP+TP requires SP for correctness.
    • It provides a cleaner communication pattern (All-Reduce → RS/AG) that simplifies later optimization.

2. Technical Plan

2.1 Key Idea

  • Split the sequence dimension across N devices.
  • Each device processes only its local tokens.
  • Communication patterns are adapted from all-reduce to reduce-scatter + all-gather.

2.2 Expert Parallelism (Existing SP plan)

  • DeepEP benefits from smaller per-device buffers and better overlap of communication and computation.
  • Communication patterns are adapted to all-reduce + scatter + all-to-all.
  • Before the MLP inside an MoE block, only scattered data is required. Tokens are dispatched via all-to-all to the target experts. After expert computation, another all-to-all is used to get results.
Image

2.3 Plan to support MLP Layers

  • MLPs are token-wise independent.
  • Under SP, MLPs operate on local token shards without modification.
  • Communication simplified via RS/AG.
  • For all transformer layers except the first and the last, the data remains in scattered. Inside dense linear layers, the required tensor aggregation is performed via all-gather (for row-parallel matmuls) and reduce-scatter.
Image

3. Releated Changes

To support Sequence Parallelism (SP) dense layers, several files will require modifications.
Below is an initial list of candidate files; detailed design/implementation notes can be filled in during development:

Communication utilities (e.g., communicator.py)

  • Implement ScatterMode and LayerScatterModes to enforce: all intermediate layers remain SCATTERED; only the first and last layers are TP_ATTN_FULL.
  • Replace TP-time all-reduces with reduce-scatter / all-gather (RS/AG) where modes change.

Linear layer implementations (e.g., linear.py)

  • Add logic for optional input all-gather (allgather_input) before row-parallel matmuls.
  • Switch post-matmul reductions:
    • Without SP → fall back to legacy all-reduce.
    • With SP → replace with reduce-scatter along the sequence/batch dimension.

Model definition files

  • Integrate LayerScatterModes and sp arguments into model logic to enforce correct scatter/full modes across layers.

4. Roadmap

Phase 1

  • Implement basic support for SP in dense layers.

Phase 2

  • Update model definitions to adopt SP logic.
  • Provide initial performance benchmarks.

Bug Fixes

Related resources

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions