Make TF pt-tf equivalence test more aggressive by ydshieh · Pull Request #15839 · huggingface/transformers

ydshieh · 2022-02-26T17:05:21Z

What does this PR do?

Make TF pt-tf equivalence test more aggressive.

After a series of fixes done so far, I think it is a good time to include this more aggressive testing in master branch.
(Otherwise, the new models added might have undetected issues. For example, the recent TFConvNextModel would have hidden_states not transposed to match Pytorch version issue - I tested it on my local branch and informed the author to fix it).

There are still 3 categories of PT/TF inconsistency to address, but they are less urgent in my opinion. See below.
Currently, the test makes a few exception to not test these 3 cases (in order to get green test) - I add TODO comments in the code.

TF: @Rocketknight1 @gante
Test: @LysandreJik @sgugger

TODO in separate PRs

failing due to the difference of large negative values (used for attn mask) between PT/TF:
- https://circleci.com/api/v1.1/project/github/huggingface/transformers/379126/output/112/0?file=true&allocation-id=621b51ac72134246517cac0c-0-build%2F6E4C1534
- albert, convbert, speech_to_text, t5, tapas
failing due to cache format difference between PT/TF:
- https://app.circleci.com/pipelines/github/huggingface/transformers/35203/workflows/d0c0fbb4-661b-4a61-96e7-2fd4a3249f44/jobs/379098/parallel-runs/0/steps/0-112
- gpt2, t5, speech_to_text
failing due to outputting loss or not between PT/TF:
- https://app.circleci.com/pipelines/github/huggingface/transformers/35206/workflows/d2533f38-6877-4db5-9eb6-da92e6421e11/jobs/379144/parallel-runs/0/steps/0-112
- flaubert, funnel, transfo_xl, xlm

HuggingFaceDocBuilder · 2022-02-26T17:05:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

gante

thorough tests <3 and +1 for leaving the plan as comments, makes it easier to review.

tests/test_modeling_tf_common.py

Rocketknight1

Overall, this looks good. If I understand correctly, it makes the following major changes:

Much smaller tolerances for differences between PT + TF outputs

Yes
Verifies that all output keys are the same across both models.

Yes
Crossloading is done in-memory instead of saving and loading a checkpoint:

Yes (it was done this way previously)

Is that correct? Is there any other important parts that I missed?

It's correct, and no you didn't miss anything. But just a remark: we test 2 cases: labels passed to model / labels not passed to model.
(previously, we only test the case without passing labels)

tests/test_modeling_tf_common.py

Rocketknight1 · 2022-02-28T16:13:48Z

tests/test_modeling_tf_common.py

Will we need some kind of relative tolerance here? 1e-5 is a small allowable difference for potentially large values!

Probably yes. Currently I have confidence that 1e-5 would work in our current testing configs/models.
The weights are initialized with a std 0.02 -> and with this setting, the output values won't get too large.
(Lysandre told me that once we go for GPU testing and fp16 precision testing, we might need to deal with larger errors).

Yes, would love to know if this passes on GPU as it's typically tougher on small differences.

The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.

Here we do the init ourselves, so it doesn't apply.

LysandreJik

Thank you for your time spent making this better! Two questions:

Does it run on GPU?
How long does it take to run? Our test suite is already taking a significant time to run, so aiming for common tests that run as fast as possible is important.

There's quite a bit of model-specific logic which I'm not particularly enthusiastic about (pre-training models + convnext), but I understand why it's rigorous to do it like that here.

LysandreJik · 2022-02-28T20:33:35Z

tests/test_modeling_tf_common.py

Yes, would love to know if this passes on GPU as it's typically tougher on small differences.

The point regarding fp16 was specifically regarding bf16: models trained with bf16 typically have much larger logits, so using an absolute difference is unideal; using relative difference should be used.

Here we do the init ourselves, so it doesn't apply.

tests/test_modeling_tf_common.py

ydshieh · 2022-02-28T20:50:51Z

Thank you for your time spent making this better! Two questions:
* Does it run on GPU?

For now, I haven't tested it with GPU. I can run it on the office's GPU machine this week (a good chance to learn how to connect to those machine!)

* How long does it take to run? Our test suite is already taking a significant time to run, so aiming for common tests that run as fast as possible is important.

Let me measure the timing of this test with the current master and with this PR. Will report it.

There's quite a bit of model-specific logic which I'm not particularly enthusiastic about (pre-training models + convnext), but I understand why it's rigorous to do it like that here.

(Yeah, once we fix all the inconsistency, we can remove all these exceptional conditions.)

sgugger

Thanks for writing those new tests!

tests/test_modeling_tf_common.py

ydshieh · 2022-03-01T17:54:18Z

Good news! Testing on single GPU with the small tolerance 1e-5 still works! All models pass test_pt_tf_model_equivalence.
(I forgot to install CUDA driver - and I'm glad I double-checked and fixed this silly mistake :-) )

I will address a few style review suggestions. After that, I think it's ready to merge ..?

LysandreJik · 2022-03-01T19:04:50Z

It's weird that it took more time when you expected it to take less, no? Can you try running the test suite with --durations=0 to see all the tests and the time it took for them to run?

ydshieh · 2022-03-05T16:02:56Z

After a thorough verification and a few fixes, this PR is ready (again) from my side.

I would love @sgugger (and @LysandreJik when he is back) to check it again, and @gante & @Rocketknight1 if they want (no particular TF-related change compared to the last commit).

The following summary might save some review (again) time

Main fixes (since the last review):
- In the last commit, the following line forgot to use to_tuple(), and the method check_output (previous version) only deal with tuple or tensor. So half test cases passed just because no check was performed at all. (my bad ...).
  https://github.com/huggingface/transformers/blob/6f2025054dc49a42dc3de82cbea22f2a5913f122/tests/test_modeling_tf_common.py#L447
  This is fixed now, and also check_output now raises error if a result is not tested (if its type is not in [tuple, list, tensor]).
- In the last commit, I accidentally deleted the following test case (check the results after save/load checkpoints, which exists on the current master).
  https://github.com/huggingface/transformers/blob/6f2025054dc49a42dc3de82cbea22f2a5913f122/tests/test_modeling_tf_common.py#L576
  This is added back now.
- Make the test also run on GPU (put models/inputs on the correct device)
Style changes
- Since I have to add back the accidentally deleted test cases , I added check_pt_tf_models to avoid duplicated large code block.
- tfo -> tf_outputs and pto -> pt_outputs
- no need to use np.copy in here, which was discussed in a comment.

I run this new version of test, and with the small tolerance 1e-5, it pass both on CPU and GPU 🎉!

In order to be super sure, I also ran it on CPU/GPU for 100 times (very aggressive 🔥 🔥 )! All models passed for this test, except for:

TFSpeech2TextModel: this is addressed in Set scale_embedding to False in some TF tests #15952
(TFHubertModel): There is 1 among 300 runs on CPU where I got a diff > 1e-5. I ran it 10000 times again, and still got only 1 such occurrence. (Guess it's OK to set 1e-5 in this case.)

Regarding the running time, I will measure it and post the results in the next comment.

ydshieh · 2022-03-05T16:23:54Z

In terms of running time:

Circle CI
- current : 56.11s
- this PR : 61.21s
GCP VM (with -n 1 + CPU)
- current : 178.75s
- this PR : 200.81s

These suggest an increase by roughly 10%.

sgugger

Still good to me! Left two nits.

tests/test_modeling_tf_common.py

sgugger · 2022-03-07T15:24:09Z

tests/test_modeling_tf_common.py

Should this be cleaned up then?

This block should be activated in the future:

we want to clean up (next steps) the large negative values like 1e-9 etc. for attention mask

currently, this block will fail due to the different such values

once 1.) is done, we should enable this block to make sure no regression & the new models work in the same manner as the existing ones

(If you prefer, I can remove this block in this PR, and add it back once we are ready for it)

(I probably said previously in the wrong way - this block is intended to be kept (and enabled) rather than being removed, sorry)

If the plan is to bring it back, by all means leave it! But make it clear in the comment :-)

Yes, Sir!

Done as follows

https://github.com/huggingface/transformers/blob/a6c7157db800efa44433ffc654c1c827776d67f6/tests/test_modeling_tf_common.py#L548-L555

(also changed the remaining tfo/pto to tf_output/pt_output)

ydshieh · 2022-03-08T16:18:12Z

I think I made a mistake that ~~TF 2.8 doesn't work with CUDA 11.4 (installed on my GCP VM machine)~~ K80 GPU doesn't work well with Ubuntu 20.04 (regarding the drivers and CUDA), and it fallbacks to CPU instead. I will fix this and re-run the tests.

ydshieh · 2022-03-11T21:12:10Z

After using Tesla T4 GPU (same as for the CI GPU testings), I confirmed that this aggressive PT/TF equivalence test passes with 1e-5 when running on GPU!

I ran this test 1000 times (on GPU) for each TF models.

Think it's ready if @LysandreJik is happy with the slightly(?) increased running time.

(I saved all the differences. I can share the values if you would like to see them!)

LysandreJik

Sounds good to me, thank you for working on it @ydshieh!

…alence()

ydshieh marked this pull request as ready for review February 28, 2022 11:29

ydshieh changed the title ~~[WIP] Make TF pt-tf equivalence test more aggressive~~ Make TF pt-tf equivalence test more aggressive Feb 28, 2022

gante approved these changes Feb 28, 2022

View reviewed changes

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved

ydshieh mentioned this pull request Feb 28, 2022

Make Flax pt-flax equivalence test more aggressive #15841

Merged

Rocketknight1 approved these changes Feb 28, 2022

View reviewed changes

LysandreJik reviewed Feb 28, 2022

View reviewed changes

sgugger approved these changes Mar 1, 2022

View reviewed changes

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved

tests/test_modeling_tf_common.py Outdated Show resolved Hide resolved

ydshieh marked this pull request as draft March 3, 2022 15:41

ydshieh changed the title ~~Make TF pt-tf equivalence test more aggressive~~ [WIP] Make TF pt-tf equivalence test more aggressive Mar 3, 2022

ydshieh force-pushed the aggressive_pt_tf_equiv_test branch from 31fdfdc to 32fb03d Compare March 4, 2022 17:08

ydshieh changed the title ~~[WIP] Make TF pt-tf equivalence test more aggressive~~ Make TF pt-tf equivalence test more aggressive Mar 4, 2022

ydshieh marked this pull request as ready for review March 4, 2022 17:47

ydshieh mentioned this pull request Mar 5, 2022

Set scale_embedding to False in some TF tests #15952

Merged

ydshieh force-pushed the aggressive_pt_tf_equiv_test branch from 42ebc89 to e9555ea Compare March 6, 2022 09:40

sgugger approved these changes Mar 7, 2022

View reviewed changes

LysandreJik approved these changes Mar 14, 2022

View reviewed changes

ydshieh added 6 commits March 14, 2022 11:41

Make TF pt-tf equivalence test more aggressive

d7c9561

Fix for TFConvNextModelTest and TFTransfoXLModelTest

f48ddbb

fix kwargs for outputs

5952a01

clean-up

66d8b9c

Add docstring for check_outputs()

168143a

remove: need to rename encoder-decoder

3533754

ydshieh added 13 commits March 14, 2022 11:41

clean-up

fccb2e5

send PyTorch things to the correct device

71e0a6d

Add back the accidentally removed test case in test_pt_tf_model_equiv…

2ae9065

…alence()

Fix: change to tuple before calling check_outputs()

c6d6933

Fix: tfo could be a list

ae07a9e

use to_tuple()

5f4a35f

allow tfo only to be tuple or tensor

57d6010

allow tfo to be list or tuple for now + style change

482fac2

minor fix

c16dca1

remove np.copy and update comments

5f5b3d9

tfo -> tf_output, same for pt

f313bc7

Add more detailed comment

c047a43

remove the incorrect comment

2e43334

ydshieh force-pushed the aggressive_pt_tf_equiv_test branch from a6c7157 to 2e43334 Compare March 14, 2022 11:11

ydshieh merged commit 923c35b into huggingface:master Mar 14, 2022

ydshieh mentioned this pull request Mar 18, 2022

Aggressive PT/TF equivalence test on PT side #16250

Merged

ydshieh deleted the aggressive_pt_tf_equiv_test branch May 5, 2022 10:29

Conversation

ydshieh commented Feb 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

TODO in separate PRs

Uh oh!

HuggingFaceDocBuilder commented Feb 26, 2022

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rocketknight1 left a comment • edited by ydshieh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rocketknight1 Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

ydshieh Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

LysandreJik Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ydshieh commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ydshieh commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik commented Mar 1, 2022 • edited by ydshieh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Mar 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Mar 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sgugger Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

ydshieh Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

ydshieh Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Mar 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

ydshieh commented Feb 26, 2022 •

edited

Loading

Rocketknight1 left a comment •

edited by ydshieh

Loading

ydshieh commented Feb 28, 2022 •

edited

Loading

ydshieh commented Mar 1, 2022 •

edited

Loading

LysandreJik commented Mar 1, 2022 •

edited by ydshieh

Loading

ydshieh commented Mar 5, 2022 •

edited

Loading

ydshieh commented Mar 5, 2022 •

edited

Loading

ydshieh Mar 7, 2022 •

edited

Loading

ydshieh commented Mar 8, 2022 •

edited

Loading

ydshieh commented Mar 11, 2022 •

edited

Loading