🚀 Feature
Change the consolidated state dict so that it becomes partition-independent
Motivation
This would make it possible to change the number of hosts when restarting a job
Pitch
state_dict() and load_state_dict() need to flatten/shard everything out, instead of storing data per rank
Alternatives
Current status, same number of ranks before and after
Additional context
Capturing elements of a discussion with the DeepSpeed MSFT team