Skip to content

[Dynamo] Fix TIMM benchmark compute_loss#97423

Closed
yanboliang wants to merge 3 commits intopytorch:masterfrom
yanboliang:timm
Closed

[Dynamo] Fix TIMM benchmark compute_loss#97423
yanboliang wants to merge 3 commits intopytorch:masterfrom
yanboliang:timm

Conversation

@yanboliang
Copy link
Copy Markdown
Contributor

@yanboliang yanboliang commented Mar 23, 2023

Fixes #97382

#95416 fixed a critical bug in dynamo benchmark, where AMP tests fall back to eager mode before that PR. However, after that PR, we found a list of TIMM models amp + eager + training testing failed.
Now we identified the root cause is: high loss values make gradient checking harder, as small changes in accumulation order upset accuracy checks. We should switch to the helper function reduce_to_scalar_loss which has been used by Torchbench tests.
After switching to reduce_to_scalar_loss, TIMM models accuracy pass rate grows from 67.74% to 91.94% in my local test. The rest 5 failed models(ese_vovnet19b_dw, fbnetc_100, mnasnet_100, mobilevit_s, sebotnet33ts_256) need further investigation and handling, but I think it should be similar reason.

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 23, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97423

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit da505cf:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Copy Markdown
Collaborator

@Chillee Chillee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:O

Perhaps I just got unlucky 🤔 I was looking at mnasnet, and I noticed that both the loss and the output are quite a bit less accurate.

@yanboliang yanboliang added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Mar 23, 2023
@yanboliang
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

amp + eager backend + training failing for some timm models

3 participants