Skip to content

Add an option to decide whether to store the checkpoint and rng_state. #26706

@timturing

Description

@timturing

Motivation:
Currently, when using the Transformers library in combination with DeepSpeed for training large language models like LLMs, checkpoints (e.g. bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt) are automatically saved along with the rng_state, which can lead to significant disk space usage. In scenarios where multiple GPUs are employed for training, this can quickly become a storage bottleneck, especially when shared by a team. Sometimes we just want to keep the bin file (e.g. pytorch_model-00001-of-00002.bin) as it's enough for load again.

Feature Request:
I propose adding a configurable option to decide whether to store the checkpoint and rng_state during training. This will give users the flexibility to choose when to save checkpoints and reduce the disk space required.

Proposed Solution:

  1. Add a new parameter, such as save_checkpoint_enabled, to the DeepSpeed configuration file. Users can set this parameter to True or False to control whether checkpoints and rng_state should be saved during training.

  2. Modify the trainer.py script in the Transformers library to include a condition for self.save_checkpoint_enabled in the _save_checkpoint function. Here's a code snippet illustrating the change:

    if self.is_deepspeed_enabled and self.save_checkpoint_enabled:
        # Save the checkpoint

This change will allow users to save disk space by not storing checkpoints when not needed, and it can help alleviate the storage challenges associated with large-scale language model training.

I have already submitted this issue to the DeepSpeed library #deepspeedai/DeepSpeed#4403 (comment) , as this feature may require collaboration between both libraries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions