🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨#24028
🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨#24028
Conversation
| for _ in train_dataloader: | ||
| break |
There was a problem hiding this comment.
Is this still necessary with the RNG reloading?
Edit: ah looks like we are not doing that yet.
There was a problem hiding this comment.
Yep not yet, so for now it's needed :)
|
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Thank you @muellerzr for all the cool work on using Accelerate for preparing dataloaders, using accelerator.gather_for_metrics and accelerator.pad_across_processes for further simplification. 🚀
Left a couple comments regarding few training arguments being ignored.
| seed = self.args.data_seed | ||
| generator.manual_seed(seed) | ||
|
|
||
| seed = self.args.data_seed if self.args.data_seed is not None else self.args.seed |
There was a problem hiding this comment.
if the user specifies self.args.data_seed or self.args.seed, it is ignored. Would this hinder reproducible experimentation?
There was a problem hiding this comment.
The tests haven't shown it to, because (from what Sylvain and I were able to find), our dataloaders don't play with the same random seed setting that these do/applying generators. Keeping this in as a generator also causes tests to fail, showing that it's detrimental in that sense now with our new DataLoader setup.
Also it's used when __iter__ through the dataset here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#L665-L667. Because we don't use a torch DistributedSampler, I don't believe it applies to us here.
If later down the road someone points to a reproducible issue to this, I'll review it again but in my deep dive last week I didn't find an issue.
| if loss is not None: | ||
| losses = self._nested_gather(loss.repeat(batch_size)) | ||
| losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0) | ||
| losses = self.accelerator.gather_for_metrics((loss.repeat(batch_size))) |
pacman100
left a comment
There was a problem hiding this comment.
Thank you for your responses to the comments and for iterating on the PR. Super cool!
…ed tests 🚨🚨🚨 (huggingface#24028) * Working integration * Fix failing test * Revert label host logic * Bring it back!
|
Great PR, currently this is breaking my custom collate_fn in the dataloader, still trying to understand why that is. First assumption might be due to multiprocessing? |
|
@franz101 please open an issue with a reproducer of what you are trying to do so we can help :) |
What does this PR do?
This PR:
DataLoaderin all basic distributed fashions (replacingpl.Loaderfor TPU coming in a follow-up PR) to replace it withaccelerator.preparetests/trainer/test_trainer.py::TrainerIntegrationTest::test_sampler_seed, deemed to no longer be necessary to reset the seed, as Accelerate's dataloader setup doesn't need any extra help when iterating/loading back in the seed, regardless of the torch versiontests/trainer/test_trainer.py::TrainerIntegrationTest::test_training_finite_iterable_dataset, as with Accelerate's new sampler forIterableDatasetwe'll actually catch if it'sNoneand raise an error, a new test will be made + clear error message on theAccelerateside, with a test added toTrainerafterwards.DataLoadersall havetotal_batch_sizerather thanbatch_sizetest_train_and_eval_dataloadersandtest_data_is_not_parallelized_when_model_is_parallelFixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger @pacman100