Skip to content

[CI] Add inductor workflow for rocm#110544

Closed
pragupta wants to merge 10 commits intopytorch:mainfrom
pragupta:add-rocm-inductor-ci
Closed

[CI] Add inductor workflow for rocm#110544
pragupta wants to merge 10 commits intopytorch:mainfrom
pragupta:add-rocm-inductor-ci

Conversation

@pragupta
Copy link
Copy Markdown
Collaborator

@pragupta pragupta commented Oct 4, 2023

This PR is to create a separate CI job for inductor UTs on ROCm. You will need to add ciflow/inductor tag on PRs to trigger this job. However, the job will run on its own on any commit merged in main. This job takes around 1.5 hours to run and it is run in parallel to other rocm jobs. It is run only on the MI210 CI runners to ensure maximum inductor functionality is tested.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Oct 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110544

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 1db9e19 with merge base 0249c4a (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Oct 4, 2023
@pragupta
Copy link
Copy Markdown
Collaborator Author

pragupta commented Oct 4, 2023

@pytorchbot label "ciflow/inductor"

@pragupta
Copy link
Copy Markdown
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased add-rocm-inductor-ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-rocm-inductor-ci && git pull --rebase)

pytorchmergebot pushed a commit that referenced this pull request Oct 12, 2023
…107760)

This PR adds a skip decorator which will disable tests in CI for ROCm inductor workflow. This new workflow will be coming in via #110544

Pull Request resolved: #107760
Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/atalman
@pragupta pragupta force-pushed the add-rocm-inductor-ci branch 3 times, most recently from 721b16e to 909146e Compare October 17, 2023 02:56
@jithunnair-amd
Copy link
Copy Markdown
Collaborator

Waiting on #110511 to be merged before enabling this workflow so as to not overburden ROCm CI.

@pragupta
Copy link
Copy Markdown
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/110544/head returned non-zero exit code 1

Rebasing (1/5)
Rebasing (2/5)
Rebasing (3/5)
Rebasing (4/5)
Auto-merging .ci/pytorch/test.sh
Auto-merging test/inductor/test_aot_inductor.py
CONFLICT (content): Merge conflict in test/inductor/test_aot_inductor.py
error: could not apply 7be8feb4f7d... Disable cpp version of test_aot_inductor test for rocm
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 7be8feb4f7d... Disable cpp version of test_aot_inductor test for rocm

Raised by https://github.com/pytorch/pytorch/actions/runs/6894323604

@pragupta
Copy link
Copy Markdown
Collaborator Author

Saving the passing CI link before rebasing: https://hud.pytorch.org/pr/110544

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

jithunnair-amd commented Nov 16, 2023

Saving the passing CI link before rebasing: https://hud.pytorch.org/pr/110544

Be aware that the link is dynamic; it will also update when you rebase :) Your best bet is to save the links to the specific jobs that ran eg. https://github.com/pytorch/pytorch/actions/runs/6541944809/job/17764595548.

@pragupta pragupta force-pushed the add-rocm-inductor-ci branch from 909146e to b69a69c Compare November 16, 2023 17:53
@pragupta pragupta marked this pull request as ready for review November 16, 2023 17:59
@pragupta pragupta requested a review from a team as a code owner November 16, 2023 17:59
@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Nov 16, 2023

Please help double check the duration of the new test job too. I'm waiting for the current run to finish https://github.com/pytorch/pytorch/actions/runs/6894570049/job/18757398102 to see how long it takes to finish. More shards might be needed if it takes more than 2 hours

Comment on lines +21 to +23
#Set Default values for these variables in case they are not set
SHARD_NUMBER="${SHARD_NUMBER:=1}"
NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is needed? I would prefer it to error out if SHARD_NUMBER is not defined

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to run inductor config, it seems that SHARD_NUMBER is required now. (https://github.com/pytorch/pytorch/pull/110544/files#diff-9709e5db13aeac90c0312b2b8da34b37cf51242ec789ca959fcf1ba295a8da7aR1101)

Although this variable is always defined in all the CI yaml files that trigger these tests, it's nice to be able to run this file locally outside of CI context.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 17, 2023
@pragupta
Copy link
Copy Markdown
Collaborator Author

Please help double check the duration of the new test job too. I'm waiting for the current run to finish https://github.com/pytorch/pytorch/actions/runs/6894570049/job/18757398102 to see how long it takes to finish. More shards might be needed if it takes more than 2 hours

@huydhn -- I have noticed that it takes around 1.5 hours on average for this job to run. Is that sufficient for one shard?

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased add-rocm-inductor-ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-rocm-inductor-ci && git pull --rebase)

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

@huydhn @malfet Can you please approve and merge this PR?

@jithunnair-amd jithunnair-amd requested a review from jansel January 8, 2024 21:25
@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 8, 2024
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor module: rocm AMD GPU support for Pytorch open source rocm priority high priority ROCm PRs from performance or other aspects topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants