- [x] Support normal DeepEP buffer @liz-badada #4232 - [x] Support DeepEP with async transfer @fzyzcjy #4610 - [x] Support low-latency DeepEP buffer - [x] Single-node TP @liz-badada #4767 - MaskedDeepGeMM is implemented by @laixinn @sleepcoo - Improved by @yuleil #5277 - [x] Multi-node TP @liz-badada #5068 - [x] Support PD disaggregation @ch-wan #5435 - [ ] Integrate pplx-kernels @ruizhang1230 #7272 - [ ] Optimize permutation overhead - [x] Implement Titon kernels @xutizhou #4643 - [ ] Fuse permutation with GroupedGeMM - [x] Extend parallelism paradigm - [x] Extend DeepEP to a general TP paradigm @ch-wan @tarinkk #4770 - Fixed by @fzyzcjy #4883 - [x] Support `tp_size < ep_size` - `tp_size=1` @fzyzcjy #4836 - [x] Overlap two batches @fzyzcjy #4068 - [x] Integrate continuous DeepGeMM @sleepcoo @xutizhou #5626 - [x] Record expert distribution @yuhsuan-t #4435 - Improved by @fzyzcjy #4957 - [ ] Overlap communication with shared experts’ computation @liz-badada #5829 - [x] Integrate EPLB @fzyzcjy #5295 Others - The DeepSeek team is going to release a permutation kernel shortly. We may need to check their update https://github.com/deepseek-ai/DeepGEMM/issues/57#issuecomment-2720514270
tp_size < ep_sizetp_size=1@fzyzcjy Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 #4836Others