Try to fix training Loss inconsistent after resume from old checkpoint#25872
Try to fix training Loss inconsistent after resume from old checkpoint#25872amyeroberts merged 10 commits intohuggingface:mainfrom dumpmemory:patch-1
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
cc @muellerzr |
|
Hi @dumpmemory thanks! Can you do |
I will check it again. |
Done |
|
@dumpmemory what does the following show:
|
|
muellerzr
left a comment
There was a problem hiding this comment.
Thanks! Looks good to me and my tests all pass locally
|
@amyeroberts feel free to merge if it looks good with you |
I am ok for this pr 😁. Thanks for your support. |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for working on fixing this! Overall change looks OK, however the logic should be simplified.
@muellerzr what was the reason for removing this logic originally?
|
Originally we had thought Accelerate handled this, but it turns out it does not |
|
@amyeroberts , pls help me to check the current version. |
|
@amyeroberts can the current version be merged ? is there any thing else i need to change, pls just tell me |
|
@dumpmemory please have a bit of patience, our team works across multiple timezones and have many other PR's and responsibilities to get to aside this one. We'll get to this when we can, please don't spam :) Thanks |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for iterating - code it looking good!
Just a comment on the utility function - we want functions to be as atomic as possible. Once updated we'll be good to merge.
src/transformers/trainer_pt_utils.py
Outdated
| def check_dataloader_randomsampler(dataloader): | ||
| if hasattr(dataloader, "sampler") and isinstance(dataloader.sampler, RandomSampler): | ||
| return dataloader.sampler, True | ||
| if hasattr(dataloader, "batch_sampler"): | ||
| return check_dataloader_randomsampler(dataloader.batch_sampler) | ||
| return dataloader.sampler, False |
There was a problem hiding this comment.
This should just return the sampler, and then the user can choose what they do with the output e.g. check if it's a random sampler. This ensures the function is as versatile as possible and can be used / extended without issue.
| def check_dataloader_randomsampler(dataloader): | |
| if hasattr(dataloader, "sampler") and isinstance(dataloader.sampler, RandomSampler): | |
| return dataloader.sampler, True | |
| if hasattr(dataloader, "batch_sampler"): | |
| return check_dataloader_randomsampler(dataloader.batch_sampler) | |
| return dataloader.sampler, False | |
| def get_dataloader_sampler(dataloader): | |
| if hasattr(dataloader, "sampler"): | |
| return dataloader.sampler | |
| if hasattr(dataloader, "batch_sampler"): | |
| return get_dataloader_sampler(dataloader.batch_sampler) |
There was a problem hiding this comment.
as what i found in #25862, if hasattr(dataloader, "sampler"), might not be enough. after accelerate.prepare function, the dataloader.sampler changed from random to torch.utils.data.sampler.SequentialSampler. i will modify the code to just return sampler
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for iterating on this. Just one last comment on the structure of get_dataloader_sampler
|
@amyeroberts How about currently version. I have checked the sampler in final if statement. |
amyeroberts
left a comment
There was a problem hiding this comment.
@dumpmemory Could you explain in some more detail why the suggested implementation of get_dataloader_sampler isn't the one being used? For the current diff, it's not clear why some of the additional logic e.g. checking isinstance is added.
thanks @amyeroberts Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
|
Thanks for your reviews. I think it is ready now. Thanks for your kind helping. |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for iterating!
|
@dumpmemory There's a current failing test (which I believe is unreleated to your PR). Could you rebase on main to include any recent updates on this branch and trigger a re-run of the CI? |
ok, i will do that |
huggingface#25872) * fix loss inconsistent after resume huggingface#25340 * fix typo * clean code * reformatted code * adjust code according to comments * adjust check_dataloader_randomsampler location * return sampler only * handle sampler is None * Update src/transformers/trainer_pt_utils.py thanks @amyeroberts Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
Fixes #25340 (issue)
From my side, it might relate to the RandomSampler. i just recopy the logic from 4.29.2
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.