Add an option to decide whether to store the checkpoint and rng_state.

**Motivation:**
Currently, when using the Transformers library in combination with DeepSpeed for training large language models like LLMs, checkpoints (e.g. `bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt`) are automatically saved along with the `rng_state`, which can lead to significant disk space usage. In scenarios where multiple GPUs are employed for training, this can quickly become a storage bottleneck, especially when shared by a team. Sometimes we just want to keep the bin file (e.g. `pytorch_model-00001-of-00002.bin`) as it's enough for load again.

**Feature Request:**
I propose adding a configurable option to decide whether to store the checkpoint and `rng_state` during training. This will give users the flexibility to choose when to save checkpoints and reduce the disk space required.

**Proposed Solution:**

1. Add a new parameter, such as `save_checkpoint_enabled`, to the DeepSpeed configuration file. Users can set this parameter to `True` or `False` to control whether checkpoints and `rng_state` should be saved during training.

2. Modify the `trainer.py` script in the Transformers library to include a condition for `self.save_checkpoint_enabled` in the `_save_checkpoint` function. Here's a code snippet illustrating the change:

   ```python
   if self.is_deepspeed_enabled and self.save_checkpoint_enabled:
       # Save the checkpoint
   ```

This change will allow users to save disk space by not storing checkpoints when not needed, and it can help alleviate the storage challenges associated with large-scale language model training.

I have already submitted this issue to the DeepSpeed library #https://github.com/microsoft/DeepSpeed/issues/4403#issue-1913025248 , as this feature may require collaboration between both libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to decide whether to store the checkpoint and rng_state. #26706

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add an option to decide whether to store the checkpoint and rng_state. #26706

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions