[Feature] Support sequence parallel on dense model & mlp layer

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

## 1. Motivation

The primary motivations for introducing SP are:

1. **Reducing redundant LayerNorm computations**
   - In pure TP, LayerNorm (LN) is replicated across all tensor-parallel ranks.
   - With SP, LN can be partitioned along the sequence dimension, avoiding duplicated LN computation and improving efficiency.

2. **Enabling fused operator design**
   - SP lays the groundwork for more effective kernel fusion strategies (e.g., combining LN, residual add, bias, activation).
   - This leads to better devices utilization and lower kernel launch overhead.

3. **Better alignment with multi-dimensional parallelism**
   - SP works as a natural complement to TP/EP (Expert Parallelism), especially in cases where EP+TP requires SP for correctness.
   - It provides a cleaner communication pattern (All-Reduce → RS/AG) that simplifies later optimization.

---

## 2. Technical Plan

### 2.1 Key Idea
- Split the **sequence dimension** across `N` devices.
- Each device processes only its local tokens.
- Communication patterns are adapted from **all-reduce** to **reduce-scatter + all-gather**.

### 2.2 Expert Parallelism (Existing SP plan) 
- DeepEP benefits from smaller per-device buffers and better overlap of communication and computation.
- Communication patterns are adapted to **all-reduce + scatter + all-to-all**.
- Before the MLP inside an MoE block, only **scattered data** is required.  Tokens are dispatched via **all-to-all** to the target experts.  After expert computation, another **all-to-all** is used to get results.

<img width="1231" height="295" alt="Image" src="https://github.com/user-attachments/assets/49d1f9a5-f714-4c84-9a57-07c7512a2349" />

### 2.3 Plan to support MLP Layers
- MLPs are token-wise independent.
- Under SP, MLPs operate on local token shards without modification.
- Communication simplified via RS/AG.
- For all transformer layers **except the first and the last**, the data remains in **scattered**.  Inside dense linear layers, the required tensor aggregation is performed via **all-gather** (for row-parallel matmuls) and **reduce-scatter**.

<img width="1054" height="297" alt="Image" src="https://github.com/user-attachments/assets/755a298b-6625-4396-8815-178b613fb6ec" />

---

## 3. Releated Changes

To support Sequence Parallelism (SP) dense layers, several files will require modifications.  
Below is an initial list of candidate files; detailed design/implementation notes can be filled in during development:

#### **Communication utilities** (e.g., *communicator.py*) 
  - Implement `ScatterMode` and `LayerScatterModes` to enforce: **all intermediate layers remain SCATTERED; only the first and last layers are TP_ATTN_FULL**.
  - Replace TP-time all-reduces with **reduce-scatter / all-gather (RS/AG)** where modes change.

#### **Linear layer implementations** (e.g., *linear.py*)   
  - Add logic for **optional input all-gather** (`allgather_input`) before row-parallel matmuls.  
  - Switch **post-matmul reductions**:  
    - **Without SP** → fall back to legacy all-reduce.  
    - **With SP** → replace with **reduce-scatter** along the sequence/batch dimension.

#### *Model definition files*
  - Integrate `LayerScatterModes` and `sp` arguments into model logic to enforce correct scatter/full modes across layers.

---

## 4. Roadmap

### Phase 1  
- Implement basic support for SP in dense layers.

### Phase 2  
- Update model definitions to adopt SP logic.  
- Provide initial performance benchmarks.  

### Bug Fixes  

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support sequence parallel on dense model & mlp layer #10519

Checklist

Motivation

1. Motivation

2. Technical Plan

2.1 Key Idea

2.2 Expert Parallelism (Existing SP plan)

2.3 Plan to support MLP Layers

3. Releated Changes

Communication utilities (e.g., communicator.py)

Linear layer implementations (e.g., linear.py)

Model definition files

4. Roadmap

Phase 1

Phase 2

Bug Fixes

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support sequence parallel on dense model & mlp layer #10519

Description

Checklist

Motivation

1. Motivation

2. Technical Plan

2.1 Key Idea

2.2 Expert Parallelism (Existing SP plan)

2.3 Plan to support MLP Layers

3. Releated Changes

Communication utilities (e.g., communicator.py)

Linear layer implementations (e.g., linear.py)

Model definition files

4. Roadmap

Phase 1

Phase 2

Bug Fixes

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions