Skip to content

[Inductor] [CI] [CUDA] Skip the failed models and tests the better way#127150

Closed
nWEIdia wants to merge 9 commits intopytorch:mainfrom
nWEIdia:cuda_124_ci_inductor_skip_fine
Closed

[Inductor] [CI] [CUDA] Skip the failed models and tests the better way#127150
nWEIdia wants to merge 9 commits intopytorch:mainfrom
nWEIdia:cuda_124_ci_inductor_skip_fine

Conversation

@nWEIdia
Copy link
Copy Markdown
Collaborator

@nWEIdia nWEIdia commented May 25, 2024

Address subtasks in #126692

After enabling the disabled shards, the following two models regressed (for cu124 configuration):
dynamic_inductor_timm_training.csv
cspdarknet53,pass,7 (expected) | cspdarknet53,fail_accuracy,7 (actual)
eca_botnext26ts_256,pass,7 (expected) | eca_botnext26ts_256,fail_accuracy,7 (actual)

cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @atalman @malfet @ptrblck @eqy @tinglvv @Aidyn-A

@nWEIdia nWEIdia requested a review from a team as a code owner May 25, 2024 01:38
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented May 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127150

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6fd0056 with merge base f4cbcff (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented May 25, 2024

@nWEIdia I should have pinged you earlier but missed your PR. I have a small fix for a 12.4/12.1 typo on inductor workflow here #127121 that has just been landed. You might want to merge in that change.

@nWEIdia
Copy link
Copy Markdown
Collaborator Author

nWEIdia commented May 25, 2024

@nWEIdia I should have pinged you earlier but missed your PR. I have a small fix for a 12.4/12.1 typo on inductor workflow here #127121 that has just been landed. You might want to merge in that change.

Thank you for the fix!

Add back shards and disable models by setting the accuracy to "fail
Relax inductor perf smoke speedup from 4.9 to 4.7 for cuda 12.4 only
Duplicate cu121 csv reference to cu124 to workaround pytorch#126692

Change back to pass after fixing pytorch#126692
@nWEIdia nWEIdia force-pushed the cuda_124_ci_inductor_skip_fine branch from 4c5efe9 to 4b723f8 Compare May 29, 2024 00:45
@nWEIdia nWEIdia requested a review from huydhn May 29, 2024 01:23
nWEIdia added 2 commits May 28, 2024 22:51
Just do not import torch under torch/ directory
@nWEIdia nWEIdia requested a review from atalman May 30, 2024 18:32
@nWEIdia nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label May 31, 2024
Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nWEIdia
Copy link
Copy Markdown
Collaborator Author

nWEIdia commented May 31, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

petrex pushed a commit to petrex/pytorch that referenced this pull request Jun 5, 2024
pytorch#127150)

Address subtasks in pytorch#126692

After enabling the disabled shards, the following two models regressed (for cu124 configuration):
dynamic_inductor_timm_training.csv
cspdarknet53,pass,7   (expected)                                        | cspdarknet53,fail_accuracy,7           (actual)
eca_botnext26ts_256,pass,7        (expected)                            | eca_botnext26ts_256,fail_accuracy,7 (actual)

Pull Request resolved: pytorch#127150
Approved by: https://github.com/huydhn, https://github.com/eqy, https://github.com/atalman
pytorchmergebot pushed a commit that referenced this pull request Jun 25, 2024
…ilar enough to cu121 (#128423)

Pre-requisite: close #126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: #128423
Approved by: https://github.com/atalman, https://github.com/eqy
pytorchmergebot pushed a commit that referenced this pull request Jun 27, 2024
…ilar enough to cu121 (#128423)

Pre-requisite: close #126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: #128423
Approved by: https://github.com/atalman, https://github.com/eqy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants