[BUG] Bloated MoE expert checkpoint files with ZeRO 2

**Describe the bug**
It appears that with ZeRO 2 MoE layer expert parameters are replicated across all ranks by sharing a single tensor storage (with different offsets).
Since all MoE layer experts are stored in individual files for all ranks the default `torch.save()` behavior stores the underlying tensor's storage (even though a given expert is only covering a fraction of the storage).

Thus we end up with the following file structure per layer N:
```
layer_N_expert_E_mp_rank_00_model_states.pt
...
layer_N_expert_2_mp_rank_00_model_states.pt
layer_N_expert_1_mp_rank_00_model_states.pt
layer_N_expert_0_mp_rank_00_model_states.pt
```

Where _each_ file has the storage for all E experts across layer N, but the tensors in the files are only using a fraction of it.
When `torch.load()`ed each end up being tensors with independent storage.

It seems like using `.clone().detach()` when saving MoE expert tensors to individual files should address the issue.

But I was wondering about the reason behind making each expert and individual file for all ranks. E.g. if we stored a single file for expert layer for all ranks then the storage would have been reused for the loaded tensors and this single file would not be bloated.

I'll provide a test shortly, willing to contribute the fix.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
Checkpoint files should be compact, no redundancy.

**ds_report output**
Please run `ds_report` to give us details about your setup.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**System info (please complete the following information):**
 - OS: Amazon Linux 4.14.252-195.483.amzn2.x86_64
 - GPU count and types: 8x A100 on a single machine
 - Interconnects (if applicable): single machine
 - Python version: 3.8
 - Any other relevant info about your setup: DeepSpeed 0.6.0, 0.7.4

**Launcher context**
DeepSpeed runner/launcher

**Docker context**
Custom docker image

**Additional context**
Using MoE layers, ZeRO 2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Bloated MoE expert checkpoint files with ZeRO 2 #2389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Bloated MoE expert checkpoint files with ZeRO 2 #2389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions