deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler#25863
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler#25863
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks a lot for tackling the issue you described and conducting (+ showing) the experiments you ran to prove that it works. Personally, I miss the experience with deepspeed required to understand the bigger picture, so I cannot provide a full on review, only some small comments.
muellerzr
left a comment
There was a problem hiding this comment.
Looks good to me, thanks! Let's definitely keep an eye out for pickle problems, and be prepared to move that to a util if needed
…imizer and HF scheduler (huggingface#25863) * Add support for deepspeed optimizer and HF scheduler * fix bug * fix the import * fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario * fix loading of hf scheduler when loading deepspeed checkpoint * fix import of `DeepSpeedSchedulerWrapper` * add tests * add the comment and skip the failing tests * address comment
…imizer and HF scheduler (#25863) * Add support for deepspeed optimizer and HF scheduler * fix bug * fix the import * fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario * fix loading of hf scheduler when loading deepspeed checkpoint * fix import of `DeepSpeedSchedulerWrapper` * add tests * add the comment and skip the failing tests * address comment
|
What does this PR do?
LRScheduler. Should be merged after Add support for deepspeed optimizer and custom scheduler accelerate#1909Below we will run the 4 combinations of optimizer and schedulers for the
run_glue.pytransformers exampleInitial setup:
a. HF Optimizer + HF Scheduler Case:
i. ds config
ds_config_z3_hf_optim_hf_scheduler.json:{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }ii. command to run:
Kill the process after epoch 1. run the above command with
--resume_from_checkpointas below:iii. Plots of loss and learning rate:

a. DS Optimizer + DS Scheduler Case:
i. ds config
ds_config_z3_ds_optim_ds_scheduler.json:rest of the steps as above. Plots:

c. HF Optimizer + DS Scheduler Case:
i. ds config
ds_config_z3_hf_optim_ds_scheduler.json:rest of the steps as above. Plots:

c. DS Optimizer + HF Scheduler Case:
i. ds config
ds_config_z3_ds_optim_hf_scheduler.json:rest of the steps as above. Plots:
