[Inductor] [CI] [CUDA] Skip the failed models and tests the better way by nWEIdia · Pull Request #127150 · pytorch/pytorch

nWEIdia · 2024-05-25T01:38:22Z

Address subtasks in #126692

After enabling the disabled shards, the following two models regressed (for cu124 configuration):
dynamic_inductor_timm_training.csv
cspdarknet53,pass,7 (expected) | cspdarknet53,fail_accuracy,7 (actual)
eca_botnext26ts_256,pass,7 (expected) | eca_botnext26ts_256,fail_accuracy,7 (actual)

cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @atalman @malfet @ptrblck @eqy @tinglvv @Aidyn-A

pytorch-bot · 2024-05-25T01:38:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127150

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6fd0056 with merge base f4cbcff ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

huydhn · 2024-05-25T02:38:09Z

@nWEIdia I should have pinged you earlier but missed your PR. I have a small fix for a 12.4/12.1 typo on inductor workflow here #127121 that has just been landed. You might want to merge in that change.

benchmarks/dynamo/ci_expected_accuracy/dynamic_inductor_timm_training.csv

nWEIdia · 2024-05-25T03:35:20Z

@nWEIdia I should have pinged you earlier but missed your PR. I have a small fix for a 12.4/12.1 typo on inductor workflow here #127121 that has just been landed. You might want to merge in that change.

Thank you for the fix!

.ci/pytorch/test.sh

Add back shards and disable models by setting the accuracy to "fail Relax inductor perf smoke speedup from 4.9 to 4.7 for cuda 12.4 only Duplicate cu121 csv reference to cu124 to workaround pytorch#126692 Change back to pass after fixing pytorch#126692

This reverts commit 14a1867.

Just do not import torch under torch/ directory

(https://github.com/pytorch/pytorch/actions/runs/9297551188/job/25587899137, https://github.com/pytorch/pytorch/actions/runs/9297551188/job/25587898918, https://github.com/pytorch/pytorch/actions/runs/9297551188/job/25587898648) Add two more test failures skip for cu124...

atalman

lgtm

nWEIdia · 2024-05-31T16:33:52Z

@pytorchbot merge

pytorchmergebot · 2024-05-31T16:35:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch#127150) Address subtasks in pytorch#126692 After enabling the disabled shards, the following two models regressed (for cu124 configuration): dynamic_inductor_timm_training.csv cspdarknet53,pass,7 (expected) | cspdarknet53,fail_accuracy,7 (actual) eca_botnext26ts_256,pass,7 (expected) | eca_botnext26ts_256,fail_accuracy,7 (actual) Pull Request resolved: pytorch#127150 Approved by: https://github.com/huydhn, https://github.com/eqy, https://github.com/atalman

…ilar enough to cu121 (#128423) Pre-requisite: close #126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: #128423 Approved by: https://github.com/atalman, https://github.com/eqy

nWEIdia requested a review from a team as a code owner May 25, 2024 01:38

pytorch-bot bot added ciflow/inductor module: dynamo release notes: releng release notes category labels May 25, 2024

pytorchbot added the open source label May 25, 2024

huydhn added a commit to huydhn/pytorch that referenced this pull request May 25, 2024

Avoid conflicts with pytorch#127150

f94f513

huydhn mentioned this pull request May 25, 2024

Fix typo in inductor workflow for CUDA 12.4 jobs #127121

Closed

huydhn approved these changes May 25, 2024

View reviewed changes

huydhn reviewed May 25, 2024

View reviewed changes

benchmarks/dynamo/ci_expected_accuracy/dynamic_inductor_timm_training.csv Outdated Show resolved Hide resolved

nWEIdia commented May 25, 2024

View reviewed changes

.ci/pytorch/test.sh Outdated Show resolved Hide resolved

nWEIdia force-pushed the cuda_124_ci_inductor_skip_fine branch from 4c5efe9 to 4b723f8 Compare May 29, 2024 00:45

eqy approved these changes May 29, 2024

View reviewed changes

Trying to fix torch.version error.

14a1867

nWEIdia requested a review from huydhn May 29, 2024 01:23

nWEIdia added 2 commits May 28, 2024 22:51

Revert "Trying to fix torch.version error."

9961105

This reverts commit 14a1867.

Reverting previous complicated fix. Retry pushd/popd

646a48b

Just do not import torch under torch/ directory

nWEIdia added the skip-pr-sanity-checks label May 29, 2024

nWEIdia added 4 commits May 29, 2024 18:12

Try torch.version.cuda in a different location

deedf29

// in the path does not work, how about /./?

ed13b84

Too many typos

abcf8ba

nWEIdia mentioned this pull request May 30, 2024

CUDA 12.4 CI Inductor Issues #126692

Closed

nWEIdia requested a review from atalman May 30, 2024 18:32

Tweak model pass/fail_accuracy status

6fd0056

nWEIdia added the ciflow/trunk Trigger trunk jobs on your pull request label May 31, 2024

atalman approved these changes May 31, 2024

View reviewed changes

pytorchmergebot added the merging label May 31, 2024

pytorchmergebot closed this in 67f0807 May 31, 2024

pytorchmergebot added Merged and removed merging labels May 31, 2024

nWEIdia mentioned this pull request Jun 24, 2024

[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 #128423

Closed

nWEIdia mentioned this pull request Nov 18, 2024

[BE]: Update CUDNN for Linux to 9.5.1.17 for 12.6 only #137978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] [CI] [CUDA] Skip the failed models and tests the better way#127150

[Inductor] [CI] [CUDA] Skip the failed models and tests the better way#127150
nWEIdia wants to merge 9 commits intopytorch:mainfrom
nWEIdia:cuda_124_ci_inductor_skip_fine

nWEIdia commented May 25, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented May 25, 2024 •

edited

Loading

Uh oh!

huydhn commented May 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

nWEIdia commented May 25, 2024

Uh oh!

Uh oh!

atalman left a comment

Uh oh!

nWEIdia commented May 31, 2024

Uh oh!

pytorchmergebot commented May 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

nWEIdia commented May 25, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127150

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

huydhn commented May 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nWEIdia commented May 25, 2024

Uh oh!

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

nWEIdia commented May 31, 2024

Uh oh!

pytorchmergebot commented May 31, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nWEIdia commented May 25, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 25, 2024 •

edited

Loading

huydhn commented May 25, 2024 •

edited

Loading