Skip to content

[BUG] Bloated MoE expert checkpoint files with ZeRO 2 #2389

@clumsy

Description

@clumsy

Describe the bug
It appears that with ZeRO 2 MoE layer expert parameters are replicated across all ranks by sharing a single tensor storage (with different offsets).
Since all MoE layer experts are stored in individual files for all ranks the default torch.save() behavior stores the underlying tensor's storage (even though a given expert is only covering a fraction of the storage).

Thus we end up with the following file structure per layer N:

layer_N_expert_E_mp_rank_00_model_states.pt
...
layer_N_expert_2_mp_rank_00_model_states.pt
layer_N_expert_1_mp_rank_00_model_states.pt
layer_N_expert_0_mp_rank_00_model_states.pt

Where each file has the storage for all E experts across layer N, but the tensors in the files are only using a fraction of it.
When torch.load()ed each end up being tensors with independent storage.

It seems like using .clone().detach() when saving MoE expert tensors to individual files should address the issue.

But I was wondering about the reason behind making each expert and individual file for all ranks. E.g. if we stored a single file for expert layer for all ranks then the storage would have been reused for the loaded tensors and this single file would not be bloated.

I'll provide a test shortly, willing to contribute the fix.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
Checkpoint files should be compact, no redundancy.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Amazon Linux 4.14.252-195.483.amzn2.x86_64
  • GPU count and types: 8x A100 on a single machine
  • Interconnects (if applicable): single machine
  • Python version: 3.8
  • Any other relevant info about your setup: DeepSpeed 0.6.0, 0.7.4

Launcher context
DeepSpeed runner/launcher

Docker context
Custom docker image

Additional context
Using MoE layers, ZeRO 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions