[benchmarks] Add scalar loss as model output when training by benjaminglass1 · Pull Request #158074 · pytorch/pytorch

benjaminglass1 · 2025-07-10T22:28:30Z

Stack from ghstack (oldest at bottom):

Add a hook to benchmark model forward passes that calculates a scalar loss as the first output when training (and detaches all other outputs). This is a requirement to use joint graph export (experimental), but can be done without loss of generality.

Additionally ensures that Dynamo traces through the loss calculation function (not done previously), which reduces the number of graph breaks in models when training.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

[ghstack-poisoned]

pytorch-bot · 2025-07-10T22:28:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158074

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Unrelated Failure

As of commit 1ec9923 with merge base 1abff80 ():

NEW FAILURES - The following jobs have failed:

pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh)
dynamo/test_reconstruct.py::ReconstructTest::test_graph_break_in_wrapped_skipped_function
pull / linux-jammy-cuda12.8-py3.10-gcc11-sm89 / test (default, 4, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
dynamo/test_reconstruct.py::ReconstructTest::test_graph_break_in_wrapped_skipped_function
pull / linux-jammy-py3.10-clang18-asan / test (default, 2, 6, linux.4xlarge) (gh)
dynamo/test_reconstruct.py::ReconstructTest::test_graph_break_in_wrapped_skipped_function
pull / linux-jammy-py3.13-clang12 / test (default, 2, 5, linux.4xlarge) (gh)
dynamo/test_reconstruct.py::ReconstructTest::test_graph_break_in_wrapped_skipped_function
pull / linux-jammy-py3.9-clang12 / test (default, 4, 5, linux.4xlarge) (gh)
dynamo/test_reconstruct.py::ReconstructTest::test_graph_break_in_wrapped_skipped_function
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, linux.2xlarge) (gh)
dynamo/test_reconstruct.py::ReconstructTest::test_graph_break_in_wrapped_skipped_function

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

desertfire · 2025-07-14T21:31:30Z

@anijain2305 , can you help to review this dynamo change?

[ghstack-poisoned]

benjaminglass1 · 2025-07-15T16:56:02Z

@anijain2305 I ended up needing to add one new model XFAIL to this one, for nvidia_deeprecommender. I've spent several days trying to understand the failure, but effectively adding the loss return hook to that one model somehow changes the output of the forward pass? The failure is incredibly confusing, and I've been unable to determine any viable mechanism for it. Since this works for all other models in the training set, it seems reasonable to add a XFAIL for this one.

[ghstack-poisoned]

benjaminglass1 · 2025-07-16T14:50:41Z

ROCm test failure appears unrelated to this PR.

[ghstack-poisoned]

benjaminglass1 · 2025-08-25T20:17:45Z

Closing this, as we've decided not to pursue the AOTInductor training work at this time.

Update

ab39730

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: dynamo labels Jul 10, 2025

This was referenced Jul 10, 2025

[AOTI] Add missing ops to set of C-shim ops which can have nullptr returns #158073

Closed

Grab bag of (mostly) typing improvements #158075

Closed

[NOT FOR MERGE] Exploratory work on AOTInductor training #155877

Closed

facebook-github-bot added the module: rocm AMD GPU support for Pytorch label Jul 10, 2025

benjaminglass1 self-assigned this Jul 10, 2025

benjaminglass1 requested a review from desertfire July 10, 2025 22:29

benjaminglass1 added the topic: not user facing topic category label Jul 10, 2025

pytorchbot added the open source label Jul 10, 2025

benjaminglass1 added release notes: dynamo open source and removed open source topic: not user facing topic category labels Jul 10, 2025

desertfire requested a review from anijain2305 July 14, 2025 21:31

Update

0855787

[ghstack-poisoned]

benjaminglass1 added 3 commits July 15, 2025 17:28

Update

17b0744

[ghstack-poisoned]

Update

59a3f26

[ghstack-poisoned]

Update

a7594c1

[ghstack-poisoned]

benjaminglass1 added ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Jul 15, 2025

benjaminglass1 added 4 commits July 16, 2025 14:56

Update

71325a5

[ghstack-poisoned]

Update

3524e41

[ghstack-poisoned]

Update

9ff11c7

[ghstack-poisoned]

Update

019403b

[ghstack-poisoned]

benjaminglass1 added 2 commits July 24, 2025 03:55

Update

a690e86

[ghstack-poisoned]

Update

1ec9923

[ghstack-poisoned]

benjaminglass1 closed this Aug 25, 2025

github-actions Bot deleted the gh/benjaminglass1/93/head branch September 25, 2025 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmarks] Add scalar loss as model output when training#158074

[benchmarks] Add scalar loss as model output when training#158074
benjaminglass1 wants to merge 11 commits intogh/benjaminglass1/93/basefrom
gh/benjaminglass1/93/head

benjaminglass1 commented Jul 10, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jul 10, 2025 •

edited

Loading

Uh oh!

desertfire commented Jul 14, 2025

Uh oh!

benjaminglass1 commented Jul 15, 2025

Uh oh!

benjaminglass1 commented Jul 16, 2025

Uh oh!

benjaminglass1 commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

benjaminglass1 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158074

❌ 6 New Failures, 1 Unrelated Failure

Uh oh!

desertfire commented Jul 14, 2025

Uh oh!

benjaminglass1 commented Jul 15, 2025

Uh oh!

benjaminglass1 commented Jul 16, 2025

Uh oh!

benjaminglass1 commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benjaminglass1 commented Jul 10, 2025 •

edited

Loading

pytorch-bot Bot commented Jul 10, 2025 •

edited

Loading