-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
It appears that with ZeRO 2 MoE layer expert parameters are replicated across all ranks by sharing a single tensor storage (with different offsets).
Since all MoE layer experts are stored in individual files for all ranks the default torch.save() behavior stores the underlying tensor's storage (even though a given expert is only covering a fraction of the storage).
Thus we end up with the following file structure per layer N:
layer_N_expert_E_mp_rank_00_model_states.pt
...
layer_N_expert_2_mp_rank_00_model_states.pt
layer_N_expert_1_mp_rank_00_model_states.pt
layer_N_expert_0_mp_rank_00_model_states.pt
Where each file has the storage for all E experts across layer N, but the tensors in the files are only using a fraction of it.
When torch.load()ed each end up being tensors with independent storage.
It seems like using .clone().detach() when saving MoE expert tensors to individual files should address the issue.
But I was wondering about the reason behind making each expert and individual file for all ranks. E.g. if we stored a single file for expert layer for all ranks then the storage would have been reused for the loaded tensors and this single file would not be bloated.
I'll provide a test shortly, willing to contribute the fix.
To Reproduce
Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior
Checkpoint files should be compact, no redundancy.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Amazon Linux 4.14.252-195.483.amzn2.x86_64
- GPU count and types: 8x A100 on a single machine
- Interconnects (if applicable): single machine
- Python version: 3.8
- Any other relevant info about your setup: DeepSpeed 0.6.0, 0.7.4
Launcher context
DeepSpeed runner/launcher
Docker context
Custom docker image
Additional context
Using MoE layers, ZeRO 2