Checklist
Motivation
1. Motivation
The primary motivations for introducing SP are:
-
Reducing redundant LayerNorm computations
- In pure TP, LayerNorm (LN) is replicated across all tensor-parallel ranks.
- With SP, LN can be partitioned along the sequence dimension, avoiding duplicated LN computation and improving efficiency.
-
Enabling fused operator design
- SP lays the groundwork for more effective kernel fusion strategies (e.g., combining LN, residual add, bias, activation).
- This leads to better devices utilization and lower kernel launch overhead.
-
Better alignment with multi-dimensional parallelism
- SP works as a natural complement to TP/EP (Expert Parallelism), especially in cases where EP+TP requires SP for correctness.
- It provides a cleaner communication pattern (All-Reduce → RS/AG) that simplifies later optimization.
2. Technical Plan
2.1 Key Idea
- Split the sequence dimension across
N devices.
- Each device processes only its local tokens.
- Communication patterns are adapted from all-reduce to reduce-scatter + all-gather.
2.2 Expert Parallelism (Existing SP plan)
- DeepEP benefits from smaller per-device buffers and better overlap of communication and computation.
- Communication patterns are adapted to all-reduce + scatter + all-to-all.
- Before the MLP inside an MoE block, only scattered data is required. Tokens are dispatched via all-to-all to the target experts. After expert computation, another all-to-all is used to get results.
2.3 Plan to support MLP Layers
- MLPs are token-wise independent.
- Under SP, MLPs operate on local token shards without modification.
- Communication simplified via RS/AG.
- For all transformer layers except the first and the last, the data remains in scattered. Inside dense linear layers, the required tensor aggregation is performed via all-gather (for row-parallel matmuls) and reduce-scatter.
3. Releated Changes
To support Sequence Parallelism (SP) dense layers, several files will require modifications.
Below is an initial list of candidate files; detailed design/implementation notes can be filled in during development:
Communication utilities (e.g., communicator.py)
- Implement
ScatterMode and LayerScatterModes to enforce: all intermediate layers remain SCATTERED; only the first and last layers are TP_ATTN_FULL.
- Replace TP-time all-reduces with reduce-scatter / all-gather (RS/AG) where modes change.
Linear layer implementations (e.g., linear.py)
- Add logic for optional input all-gather (
allgather_input) before row-parallel matmuls.
- Switch post-matmul reductions:
- Without SP → fall back to legacy all-reduce.
- With SP → replace with reduce-scatter along the sequence/batch dimension.
Model definition files
- Integrate
LayerScatterModes and sp arguments into model logic to enforce correct scatter/full modes across layers.
4. Roadmap
Phase 1
- Implement basic support for SP in dense layers.
Phase 2
- Update model definitions to adopt SP logic.
- Provide initial performance benchmarks.
Bug Fixes
Related resources
No response
Checklist
Motivation
1. Motivation
The primary motivations for introducing SP are:
Reducing redundant LayerNorm computations
Enabling fused operator design
Better alignment with multi-dimensional parallelism
2. Technical Plan
2.1 Key Idea
Ndevices.2.2 Expert Parallelism (Existing SP plan)
2.3 Plan to support MLP Layers
3. Releated Changes
To support Sequence Parallelism (SP) dense layers, several files will require modifications.
Below is an initial list of candidate files; detailed design/implementation notes can be filled in during development:
Communication utilities (e.g., communicator.py)
ScatterModeandLayerScatterModesto enforce: all intermediate layers remain SCATTERED; only the first and last layers are TP_ATTN_FULL.Linear layer implementations (e.g., linear.py)
allgather_input) before row-parallel matmuls.Model definition files
LayerScatterModesandsparguments into model logic to enforce correct scatter/full modes across layers.4. Roadmap
Phase 1
Phase 2
Bug Fixes
Related resources
No response