Skip to content

[CUDA] [CI]: Enable CUDA 12.4 CI#121956

Closed
nWEIdia wants to merge 13 commits intopytorch:mainfrom
nWEIdia:cuda_124_ci
Closed

[CUDA] [CI]: Enable CUDA 12.4 CI#121956
nWEIdia wants to merge 13 commits intopytorch:mainfrom
nWEIdia:cuda_124_ci

Conversation

@nWEIdia
Copy link
Copy Markdown
Collaborator

@nWEIdia nWEIdia commented Mar 15, 2024

Reference PR: #93406

cc @atalman @malfet @ptrblck @eqy

@nWEIdia nWEIdia requested a review from a team as a code owner March 15, 2024 06:48
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121956

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a4a5d05 with merge base 5ea956a (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 15, 2024
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 15, 2024
Copy link
Copy Markdown
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it needs some discussion, because if we are to stop building/testing CUDA-11.8, it means we are loosing Keplers, don't we?
@atalman is there a doc whether this is intended

@johnnynunez
Copy link
Copy Markdown
Contributor

johnnynunez commented Mar 17, 2024

Hmm, I think it needs some discussion, because if we are to stop building/testing CUDA-11.8, it means we are loosing Keplers, don't we? @atalman is there a doc whether this is intended

why not mantain 11.8 as cuda 11 and 12.4 as cuda 12? And skip 12.1. I mean maintain always two versions of cuda,
for example if cuda 13 is out, the newer versions it would be 12 and 13

@nWEIdia
Copy link
Copy Markdown
Collaborator Author

nWEIdia commented Mar 18, 2024

We discussed for a short term, we would have 11.8, 12.1, and 12.4. I will need to refactor this PR to add back 11.8.

@nWEIdia nWEIdia changed the title CUDA CI changes: 11.8->12.1, 12.1->12.4 Draft: CUDA CI changes: 11.8->12.1, 12.1->12.4 Mar 18, 2024
@nWEIdia nWEIdia changed the title Draft: CUDA CI changes: 11.8->12.1, 12.1->12.4 CUDA CI changes: Add CUDA 12.4 CI Mar 22, 2024
@johnnynunez
Copy link
Copy Markdown
Contributor

when will be merged? 😋

@nWEIdia
Copy link
Copy Markdown
Collaborator Author

nWEIdia commented Mar 24, 2024

12.4 workflows are failing. Still working on coming up with a fix.

@johnnynunez
Copy link
Copy Markdown
Contributor

johnnynunez commented Apr 4, 2024

12.4 update 1 is out:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/

@johnnynunez
Copy link
Copy Markdown
Contributor

@ptrblck @nWEIdia nvidia cudnn9 now is available nvidia-cudnn-cu12 9.0.0.312
https://pypi.org/project/nvidia-cudnn-cu12/

@ptrblck
Copy link
Copy Markdown
Collaborator

ptrblck commented Apr 9, 2024

@johnnynunez Yes, it is! We will focus on 12.4 in this PR and follow up with the cuDNN update separately to avoid creating confusing issues pointing to the CUDA and cuDNN update.

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased cuda_124_ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_124_ci && git pull --rebase)

@nWEIdia nWEIdia requested a review from jeffdaily as a code owner May 10, 2024 00:16
@nWEIdia nWEIdia added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests ciflow/slow labels May 10, 2024
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented May 10, 2024

Warning: Unknown label ciflow/inductor-perf-test-nightly.
Currently recognized labels are

  • ciflow/binaries
  • ciflow/binaries_conda
  • ciflow/binaries_libtorch
  • ciflow/binaries_wheel
  • ciflow/inductor
  • ciflow/inductor-perf-compare
  • ciflow/inductor-micro-benchmark
  • ciflow/linux-aarch64
  • ciflow/mps
  • ciflow/nightly
  • ciflow/periodic
  • ciflow/rocm
  • ciflow/slow
  • ciflow/trunk
  • ciflow/unstable
  • ciflow/xpu
  • ciflow/torchbench

Please add the new label to .github/pytorch-probot.yml

@nWEIdia nWEIdia changed the title CUDA CI changes: Add CUDA 12.4 CI [CUDA] [CI]: Enable CUDA 12.4 CI May 10, 2024
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased cuda_124_ci onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_124_ci && git pull --rebase)

ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
@nWEIdia
Copy link
Copy Markdown
Collaborator Author

nWEIdia commented May 20, 2024

@malfet Could you please help take another look?
I am composing torchinductor 12.4 issues in here.
Thanks!

@atalman
Copy link
Copy Markdown
Contributor

atalman commented May 21, 2024

Hi @nWEIdia please disable the failing tests. We will follow up on this in the issue you opened

@atalman
Copy link
Copy Markdown
Contributor

atalman commented May 23, 2024

@pytorchmergebot merge -f "All required tests are pasing"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented May 24, 2024

For the context, after this change lands in trunk, the new CUDA 12.4 build starts to fails on newly created open PyTorch PR. Here is what happens:

This is not an ideal rollout, but the way for now is to ask folks to rebase onto main

@nWEIdia
Copy link
Copy Markdown
Collaborator Author

nWEIdia commented May 24, 2024

Sorry for the mishaps.

The PR went in 05/23 1:37pm, @malfet issued a "pytorch rebase" at 2:22pm on the #125963 PR, the result is based on #126976 (10:31am)

I guess the lesson is we should request an immediate push to viable/strict for future occurrences like this.

pytorchmergebot pushed a commit that referenced this pull request May 25, 2024
titaiwangms pushed a commit to titaiwangms/pytorch that referenced this pull request May 28, 2024
Reference PR: pytorch#93406

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Pull Request resolved: pytorch#121956
Approved by: https://github.com/atalman
titaiwangms pushed a commit to titaiwangms/pytorch that referenced this pull request May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/inductor-micro-benchmark ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants