Skip to content

[benchmarks] Add scalar loss as model output when training#158074

Closed
benjaminglass1 wants to merge 11 commits intogh/benjaminglass1/93/basefrom
gh/benjaminglass1/93/head
Closed

[benchmarks] Add scalar loss as model output when training#158074
benjaminglass1 wants to merge 11 commits intogh/benjaminglass1/93/basefrom
gh/benjaminglass1/93/head

Conversation

@benjaminglass1
Copy link
Copy Markdown
Collaborator

@benjaminglass1 benjaminglass1 commented Jul 10, 2025

Stack from ghstack (oldest at bottom):

Add a hook to benchmark model forward passes that calculates a scalar loss as the first output when training (and detaches all other outputs). This is a requirement to use joint graph export (experimental), but can be done without loss of generality.

Additionally ensures that Dynamo traces through the loss calculation function (not done previously), which reduces the number of graph breaks in models when training.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158074

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Unrelated Failure

As of commit 1ec9923 with merge base 1abff80 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@desertfire
Copy link
Copy Markdown
Contributor

@anijain2305 , can you help to review this dynamo change?

[ghstack-poisoned]
@benjaminglass1
Copy link
Copy Markdown
Collaborator Author

@anijain2305 I ended up needing to add one new model XFAIL to this one, for nvidia_deeprecommender. I've spent several days trying to understand the failure, but effectively adding the loss return hook to that one model somehow changes the output of the forward pass? The failure is incredibly confusing, and I've been unable to determine any viable mechanism for it. Since this works for all other models in the training set, it seems reasonable to add a XFAIL for this one.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@benjaminglass1 benjaminglass1 added ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Jul 15, 2025
@benjaminglass1
Copy link
Copy Markdown
Collaborator Author

ROCm test failure appears unrelated to this PR.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@benjaminglass1
Copy link
Copy Markdown
Collaborator Author

Closing this, as we've decided not to pursue the AOTInductor training work at this time.

@github-actions github-actions Bot deleted the gh/benjaminglass1/93/head branch September 25, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm Trigger "default" config CI on ROCm module: dynamo module: rocm AMD GPU support for Pytorch open source release notes: dynamo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants