Tensor parallelism for Mixture of Experts #2074

siddharth9820 · 2022-07-06T18:54:25Z

Note: This PR is in conjunction with this PR on the Megatron-Deepspeed repo.

This PR adds tensor parallelism for non-experts. This combined with ZeRO-2 allows us to scale to roughly 2x larger base models than ZeRO-2. When tensor parallelism is enabled only for non-experts, there are duplicate tokens at each gate. It is important to drop the duplicates before they reach the experts, otherwise we run into convergence issues. In the current implementation, we drop tokens right before the AlltoAll and gather them right after the AlltoAll. These calls are done in sharded_moe.py

Update: This PR now supports tensor parallelism for experts as well.

siddharth9820 · 2022-07-06T19:25:13Z

Comparing loss curves with no tensor parallelism

deepspeed/moe/layer.py

deepspeed/utils/mappings.py

jeffra · 2022-07-26T20:24:08Z

Let's add some functional unit tests to ensure the new code paths are triggered in our tests. This will ensure things are at least functionally working in the future. Would be great to have basic correctness unit tests as well but we can discuss that offline.

siddharth9820 · 2022-07-29T15:43:46Z

@jeffra I have added some tests in tests/unit/test_moe_tp.py. Can you please check if they are good enough for now?

tests/unit/test_moe_tp.py

…ed into moe-tensor-parallelism

jerryli1981 · 2022-11-04T08:44:17Z

Comparing loss curves with no tensor parallelism

Hi, I tested the tensor parallelism for MoE, the loss curves still higher and then NAN after 600 millions of tokens

siddharth9820 added 4 commits June 29, 2022 05:22

add tensor parallelism support for non-expert groups

347d450

non-expert tensor parallelism - drop tokens before a2a

4330936

support tensor parallelism for non-experts

2643c18

fix formatting

f96e0a0

siddharth9820 requested review from RezaYazdaniAminabadi, ShadenSmith, arashb, awan-10, cli99, conglongli, duli2012, eltonzheng, jeffra, minjiaz, mrwyattii, samyam, tjruwase, xiaoxiawu-microsoft and yaozhewei as code owners July 6, 2022 18:54

siddharth9820 mentioned this pull request Jul 6, 2022

Tensor parallelism for Mixture of Experts deepspeedai/Megatron-DeepSpeed#63

Merged

Merge branch 'master' into moe-tensor-parallelism

2dfd09c

siddharth9820 added 2 commits July 7, 2022 12:37

Merge branch 'master' into moe-tensor-parallelism

7af3e87

Merge branch 'master' into moe-tensor-parallelism

1c4e8a7

siddharth9820 requested a review from samadejacobs as a code owner July 8, 2022 23:55

siddharth9820 added 4 commits July 13, 2022 02:38

migrate code for dropping tokens from megatron

763fb19

change gather function name

0a797fe

fall back to previous error message

32063d9

Merge branch 'master' into moe-tensor-parallelism

b3e2fd8

awan-10 reviewed Jul 15, 2022

View reviewed changes

deepspeed/moe/layer.py Outdated Show resolved Hide resolved

awan-10 approved these changes Jul 26, 2022

View reviewed changes

Merge branch 'master' into moe-tensor-parallelism

725c66b

jeffra reviewed Jul 26, 2022

View reviewed changes

deepspeed/utils/mappings.py Show resolved Hide resolved

siddharth9820 and others added 11 commits July 26, 2022 16:39

Merge branch 'master' into moe-tensor-parallelism

ae0030d

formatting changes

3d6a136

Merge branch 'master' into moe-tensor-parallelism

870dfd0

Merge branch 'master' into moe-tensor-parallelism

c5acd1c

Merge branch 'master' into moe-tensor-parallelism

a1c470e

Merge branch 'master' into moe-tensor-parallelism

43216ca

Merge branch 'master' into moe-tensor-parallelism

b6dd6ea

Merge branch 'master' into moe-tensor-parallelism

372c663

add unit tests

5379a21

small change

50ae30b

Merge branch 'master' into moe-tensor-parallelism

5f040c8

jeffra reviewed Jul 29, 2022

View reviewed changes

tests/unit/test_moe_tp.py Outdated Show resolved Hide resolved

jeffra approved these changes Jul 29, 2022

View reviewed changes

siddharth9820 added 2 commits July 29, 2022 21:37

remove amp from tests

8dfe33d

Merge branch 'moe-tensor-parallelism' of github.com:microsoft/DeepSpe…

f918175

…ed into moe-tensor-parallelism

siddharth9820 enabled auto-merge (squash) July 29, 2022 16:40

siddharth9820 disabled auto-merge July 29, 2022 16:41

Merge branch 'master' into moe-tensor-parallelism

0afa114

siddharth9820 enabled auto-merge (squash) July 31, 2022 23:24

siddharth9820 disabled auto-merge July 31, 2022 23:24

siddharth9820 enabled auto-merge (squash) July 31, 2022 23:24

siddharth9820 merged commit 5fe9d61 into master Aug 1, 2022

siddharth9820 deleted the moe-tensor-parallelism branch August 2, 2022 18:10

awan-10 mentioned this pull request Aug 16, 2022

[REQUEST] Why ep_size should be less than world_size? #1838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensor parallelism for Mixture of Experts #2074

Tensor parallelism for Mixture of Experts #2074

siddharth9820 commented Jul 6, 2022 •

edited

Loading

Uh oh!

siddharth9820 commented Jul 6, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jeffra commented Jul 26, 2022

Uh oh!

siddharth9820 commented Jul 29, 2022

Uh oh!

Uh oh!

jerryli1981 commented Nov 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Tensor parallelism for Mixture of Experts #2074

Tensor parallelism for Mixture of Experts #2074

Conversation

siddharth9820 commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddharth9820 commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffra commented Jul 26, 2022

Uh oh!

siddharth9820 commented Jul 29, 2022

Uh oh!

Uh oh!

jerryli1981 commented Nov 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

siddharth9820 commented Jul 6, 2022 •

edited

Loading

siddharth9820 commented Jul 6, 2022 •

edited

Loading