-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Closed
Copy link
Description
trainer will not save tokenizer and config.json when training in deepspeed-zero3 with stage3_gather_16bit_weights_on_model_save=False.
line 2776 will raise ValueError, so line 2778 self._save never run to save tokenizer and other stuff. is this expected behavior?
transformers/src/transformers/trainer.py
Lines 2771 to 2784 in d4bd33c
| elif self.is_deepspeed_enabled: | |
| # this takes care of everything as long as we aren't under zero3 | |
| if version.parse(accelerate_version) <= version.parse("0.20.3"): | |
| raise ValueError("Install Accelerate from main branch") | |
| try: | |
| state_dict = self.accelerator.get_state_dict(self.deepspeed) | |
| if self.args.should_save: | |
| self._save(output_dir, state_dict=state_dict) | |
| except ValueError: | |
| logger.warning( | |
| " stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use" | |
| " zero_to_fp32.py to recover weights" | |
| ) | |
| self.model_wrapped.save_checkpoint(output_dir) |
Originally posted by @zjjMaiMai in #24728 (comment)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels