only main process should call _save on deepspeed zero3#25959
only main process should call _save on deepspeed zero3#25959amyeroberts merged 1 commit intohuggingface:mainfrom zjjMaiMai:fix_save_deepspeed_3
Conversation
src/transformers/trainer.py
Outdated
There was a problem hiding this comment.
This does not feel like the right solution, as should_save will check if we are the main process or not, and we should only ever be saving once. You can have self.model_wrapped.save_checkpoint under that self.args.should_save I believe, but I'm not sure this is the right "fix". Definitely cc @pacman100 here
There was a problem hiding this comment.
self.model_wrapped.save_checkpoint needs to be called on each process, because each process only have part of model weight under deepspeed zero3.
There was a problem hiding this comment.
Got it, thanks for the clarification!
pacman100
left a comment
There was a problem hiding this comment.
Hello, Thank you for the fix! However, it needs correction as explained in the comment,
src/transformers/trainer.py
Outdated
There was a problem hiding this comment.
This would remove the legitimate model checkpoint when stage3_gather_16bit_weights_on_model_save=True. remove_dummy_checkpoint should only be called in the exception block.
There was a problem hiding this comment.
thanks for your explanation, it has been fixed.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
@pacman100 any things i need to do? |
pacman100
left a comment
There was a problem hiding this comment.
Thank you for iterating, LGTM!
|
@zjjMaiMai One of the hub tests are failing, complaining that the base_model is empty when pushing to the hub. Could you try running this test locally to see whether it's a result of the changes in this PR? |
|
|
@zjjMaiMai Could you try and rebase on main? This should resolve the failing tests. |
|
All green! @amyeroberts |
…5959) only main process should call _save when deepspeed zero3
Background
trainer._savecall on all process after #25817. will raiseFileExistsErrorwhen model save.What does this PR do?
this pr fix it,
trainer._savewill call on main process only.