Expert Parallel Fast Path and Router Dtype Restoration (from solar-open 102B)

Hi, I was reading [Solar Open-102B ](https://www.arxiv.org/abs/2601.07022), who use torchtitan on B200 to train a12w100B MoE recently.   They mentioned there were two tricks to improve MoE training 

<img width="2652" height="844" alt="Image" src="https://github.com/user-attachments/assets/e197f721-e108-4a05-a933-38442e3d9720" />


**Expert Parallel Fast Path** Basically removing token padding for non-EP path.

**Router Dtype Restoration** (did not really get this point, so I quote the original text)
>We identify a subtle implementation issue where the router’s sigmoid operation correctly casts to FP32 for numerical stability, but does not restore the original
dtype afterward, causing subsequent matrix operations to execute in FP32.


I wonder what's your thoughts are on this, and if we wanna adapt these them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expert Parallel Fast Path and Router Dtype Restoration (from solar-open 102B) #2225

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Expert Parallel Fast Path and Router Dtype Restoration (from solar-open 102B) #2225

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions