Hi, I was reading Solar Open-102B , who use torchtitan on B200 to train a12w100B MoE recently. They mentioned there were two tricks to improve MoE training
Expert Parallel Fast Path Basically removing token padding for non-EP path.
Router Dtype Restoration (did not really get this point, so I quote the original text)
We identify a subtle implementation issue where the router’s sigmoid operation correctly casts to FP32 for numerical stability, but does not restore the original
dtype afterward, causing subsequent matrix operations to execute in FP32.
I wonder what's your thoughts are on this, and if we wanna adapt these them.
Hi, I was reading Solar Open-102B , who use torchtitan on B200 to train a12w100B MoE recently. They mentioned there were two tricks to improve MoE training
Expert Parallel Fast Path Basically removing token padding for non-EP path.
Router Dtype Restoration (did not really get this point, so I quote the original text)
I wonder what's your thoughts are on this, and if we wanna adapt these them.