Skip to content

Expert Parallel Fast Path and Router Dtype Restoration (from solar-open 102B) #2225

@rakkit

Description

@rakkit

Hi, I was reading Solar Open-102B , who use torchtitan on B200 to train a12w100B MoE recently. They mentioned there were two tricks to improve MoE training

Image

Expert Parallel Fast Path Basically removing token padding for non-EP path.

Router Dtype Restoration (did not really get this point, so I quote the original text)

We identify a subtle implementation issue where the router’s sigmoid operation correctly casts to FP32 for numerical stability, but does not restore the original
dtype afterward, causing subsequent matrix operations to execute in FP32.

I wonder what's your thoughts are on this, and if we wanna adapt these them.

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

Status
Done

Relationships

None yet

Development

No branches or pull requests

Issue actions