Make TF pt-tf equivalence test more aggressive#15839
Make TF pt-tf equivalence test more aggressive#15839ydshieh merged 19 commits intohuggingface:masterfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
gante
left a comment
There was a problem hiding this comment.
thorough tests <3 and +1 for leaving the plan as comments, makes it easier to review.
There was a problem hiding this comment.
Overall, this looks good. If I understand correctly, it makes the following major changes:
-
Much smaller tolerances for differences between PT + TF outputs
Yes
-
Verifies that all output keys are the same across both models.
Yes
-
Crossloading is done in-memory instead of saving and loading a checkpoint:
Yes (it was done this way previously)
Is that correct? Is there any other important parts that I missed?
It's correct, and no you didn't miss anything. But just a remark: we test 2 cases: labels passed to model / labels not passed to model.
(previously, we only test the case without passing labels)
tests/test_modeling_tf_common.py
Outdated
There was a problem hiding this comment.
Will we need some kind of relative tolerance here? 1e-5 is a small allowable difference for potentially large values!
There was a problem hiding this comment.
Probably yes. Currently I have confidence that 1e-5 would work in our current testing configs/models.
The weights are initialized with a std 0.02 -> and with this setting, the output values won't get too large.
(Lysandre told me that once we go for GPU testing and fp16 precision testing, we might need to deal with larger errors).
There was a problem hiding this comment.
Yes, would love to know if this passes on GPU as it's typically tougher on small differences.
The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.
Here we do the init ourselves, so it doesn't apply.
LysandreJik
left a comment
There was a problem hiding this comment.
Thank you for your time spent making this better! Two questions:
- Does it run on GPU?
- How long does it take to run? Our test suite is already taking a significant time to run, so aiming for common tests that run as fast as possible is important.
There's quite a bit of model-specific logic which I'm not particularly enthusiastic about (pre-training models + convnext), but I understand why it's rigorous to do it like that here.
tests/test_modeling_tf_common.py
Outdated
There was a problem hiding this comment.
Yes, would love to know if this passes on GPU as it's typically tougher on small differences.
The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.
Here we do the init ourselves, so it doesn't apply.
For now, I haven't tested it with GPU. I can run it on the office's GPU machine this week (a good chance to learn how to connect to those machine!)
Let me measure the timing of this test with the current master and with this PR. Will report it.
(Yeah, once we fix all the inconsistency, we can remove all these exceptional conditions.) |
sgugger
left a comment
There was a problem hiding this comment.
Thanks for writing those new tests!
|
Good news! Testing on single GPU with the small tolerance I will address a few style review suggestions. After that, I think it's ready to merge ..? |
|
It's weird that it took more time when you expected it to take less, no? Can you try running the test suite with |
31fdfdc to
32fb03d
Compare
|
After a thorough verification and a few fixes, this PR is ready (again) from my side. I would love @sgugger (and @LysandreJik when he is back) to check it again, and @gante & @Rocketknight1 if they want (no particular TF-related change compared to the last commit). The following summary might save some review (again) time
I run this new version of test, and with the small tolerance In order to be super sure, I also ran it on CPU/GPU for 100 times (very aggressive 🔥 🔥 )! All models passed for this test, except for:
Regarding the running time, I will measure it and post the results in the next comment. |
42ebc89 to
e9555ea
Compare
sgugger
left a comment
There was a problem hiding this comment.
Still good to me! Left two nits.
tests/test_modeling_tf_common.py
Outdated
There was a problem hiding this comment.
Should this be cleaned up then?
There was a problem hiding this comment.
This block should be activated in the future:
- we want to clean up (next steps) the large negative values like
1e-9etc. for attention mask - currently, this block will fail due to the different such values
- once 1.) is done, we should enable this block to make sure no regression & the new models work in the same manner as the existing ones
- (If you prefer, I can remove this block in this PR, and add it back once we are ready for it)
(I probably said previously in the wrong way - this block is intended to be kept (and enabled) rather than being removed, sorry)
There was a problem hiding this comment.
If the plan is to bring it back, by all means leave it! But make it clear in the comment :-)
There was a problem hiding this comment.
Yes, Sir!
Done as follows
(also changed the remaining tfo/pto to tf_output/pt_output)
|
I think I made a mistake that |
|
After using Tesla T4 GPU (same as for the CI GPU testings), I confirmed that this aggressive PT/TF equivalence test passes with I ran this test 1000 times (on GPU) for each TF models. Think it's ready if @LysandreJik is happy with the slightly(?) increased running time. (I saved all the differences. I can share the values if you would like to see them!) |
LysandreJik
left a comment
There was a problem hiding this comment.
Sounds good to me, thank you for working on it @ydshieh!
a6c7157 to
2e43334
Compare
What does this PR do?
Make TF pt-tf equivalence test more aggressive.
After a series of fixes done so far, I think it is a good time to include this more aggressive testing in
masterbranch.(Otherwise, the new models added might have undetected issues. For example, the recent
TFConvNextModelwould havehidden_statesnot transposed to match Pytorch version issue - I tested it on my local branch and informed the author to fix it).There are still 3 categories of PT/TF inconsistency to address, but they are less urgent in my opinion. See below.
Currently, the test makes a few exception to not test these 3 cases (in order to get green test) - I add
TODOcomments in the code.TF: @Rocketknight1 @gante
Test: @LysandreJik @sgugger
TODO in separate PRs