Skip to content

OSS: Make the checkpoints partition-agnostic #164

@blefaudeux

Description

@blefaudeux

🚀 Feature

Change the consolidated state dict so that it becomes partition-independent

Motivation

This would make it possible to change the number of hosts when restarting a job

Pitch

state_dict() and load_state_dict() need to flatten/shard everything out, instead of storing data per rank

Alternatives

Current status, same number of ranks before and after

Additional context

Capturing elements of a discussion with the DeepSpeed MSFT team

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions