-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Description
Motivation:
Currently, when using the Transformers library in combination with DeepSpeed for training large language models like LLMs, checkpoints (e.g. bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt) are automatically saved along with the rng_state, which can lead to significant disk space usage. In scenarios where multiple GPUs are employed for training, this can quickly become a storage bottleneck, especially when shared by a team. Sometimes we just want to keep the bin file (e.g. pytorch_model-00001-of-00002.bin) as it's enough for load again.
Feature Request:
I propose adding a configurable option to decide whether to store the checkpoint and rng_state during training. This will give users the flexibility to choose when to save checkpoints and reduce the disk space required.
Proposed Solution:
-
Add a new parameter, such as
save_checkpoint_enabled, to the DeepSpeed configuration file. Users can set this parameter toTrueorFalseto control whether checkpoints andrng_stateshould be saved during training. -
Modify the
trainer.pyscript in the Transformers library to include a condition forself.save_checkpoint_enabledin the_save_checkpointfunction. Here's a code snippet illustrating the change:if self.is_deepspeed_enabled and self.save_checkpoint_enabled: # Save the checkpoint
This change will allow users to save disk space by not storing checkpoints when not needed, and it can help alleviate the storage challenges associated with large-scale language model training.
I have already submitted this issue to the DeepSpeed library #deepspeedai/DeepSpeed#4403 (comment) , as this feature may require collaboration between both libraries.