Re-implement and refactor the verifier. by ysiraichi · Pull Request #7724 · pytorch/xla

ysiraichi · 2024-07-23T00:27:13Z

This PR re-implements the existing verifier, making the following improvements:

Enabling the verification of more models on inference
- Previously, it expected the models to return a single output tensor
Enabling the verification of training
- There was a plumbing bug in TorchBenchModel.train
- Running training for a few iterations
Model cleanup
- Delete each used model so that we don't run out of memory on large models
Use PyTorch functions for checking whether the accuracy is acceptable
- Instead of using MRE, we use the same thing PyTorch does

In order to do so, here's a summary of the changes:

Introduce BenchmarkModel methods: tolerance(), use_cosine_similarity, and skip_verifier()
- The logic is taken from PyTorch, which also uses torchbench.yaml
Introduce force_dtype parameter when loading a model
- So that we can run models on eager fp64
More meaningful verification codes
Move reset_rng_state and cleanup to util.py
Change how we access the YAML configuration file, replacing the raw strings

cc @miladm @zpcore

zpcore · 2024-07-23T17:27:32Z

Thanks for adding the verifier!

ysiraichi added the xla:gpu label Jul 23, 2024

ysiraichi requested review from miladm and zpcore July 23, 2024 00:27

ysiraichi added 13 commits July 22, 2024 21:28

Move reset_rng_state to utils.py

6b52711

Add methods for retrieving custom accuracy check configuration.

6f0ba2c

Refactor how _torchbench.yaml_ data is used.

9bf72c0

Move cleanup method into _utils.py_.

4d646e9

Plumb force_dtype argument for instantiating models.

d834c09

Fix lint and add comments.

b578a77

Refactor and polish.

d55df7c

Modify cleanup signature.

29bd9d5

Minor fixes.

95b5bc9

Refactor and fix the verifier.

487e31c

Add an CLI argument for verifier iterations.

6e79fda

Fix lint issues.

f78509a

Fix device retriever.

97402f6

ysiraichi force-pushed the ysiraichi/refactor-verifier branch from 25942f7 to 97402f6 Compare July 23, 2024 00:30

ysiraichi added 2 commits July 22, 2024 21:35

Fix lint issues.

044ccd9

Fix VerificationCode value.

d38f443

zpcore reviewed Jul 23, 2024

View reviewed changes

Comment thread benchmarks/util.py

zpcore reviewed Jul 23, 2024

View reviewed changes

Comment thread benchmarks/util.py

zpcore approved these changes Jul 24, 2024

View reviewed changes

Fix no tensor in the output.

b4d7046

ysiraichi merged commit c69742c into master Jul 24, 2024

This was referenced Jul 25, 2024

[benchmarks] Fix batch size logic. #7747

Merged

Failing Torchbench Models: tracking issue #5932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement and refactor the verifier.#7724

Re-implement and refactor the verifier.#7724
ysiraichi merged 16 commits intomasterfrom
ysiraichi/refactor-verifier

ysiraichi commented Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

zpcore commented Jul 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ysiraichi commented Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

zpcore commented Jul 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants