Skip to content

Add CUDA 12.6 Linux Builds to Binaries Matrix#138899

Closed
tinglvv wants to merge 20 commits intopytorch:mainfrom
tinglvv:cuda-12.6-ci
Closed

Add CUDA 12.6 Linux Builds to Binaries Matrix#138899
tinglvv wants to merge 20 commits intopytorch:mainfrom
tinglvv:cuda-12.6-ci

Conversation

@tinglvv
Copy link
Copy Markdown
Collaborator

@tinglvv tinglvv commented Oct 25, 2024

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Oct 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138899

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit d55064c with merge base ea0f60e (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Copy Markdown
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding a new flavor? Let's delete something (for example 12.1)

@tinglvv tinglvv added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Oct 25, 2024
@tinglvv
Copy link
Copy Markdown
Collaborator Author

tinglvv commented Oct 25, 2024

Removing 12.1 for the nightly binary build per suggestion.
CI/docker images will be deprecated at a later stage.

@tinglvv tinglvv marked this pull request as ready for review October 29, 2024 22:06
@tinglvv tinglvv requested a review from a team as a code owner October 29, 2024 22:06
@tinglvv
Copy link
Copy Markdown
Collaborator Author

tinglvv commented Oct 29, 2024

Not sure if we should remove 12.1 from LINUX_BINARY_SMOKE_WORKFLOWS, removing temporarily due to the below error

tingl@tingl-mlt pytorch % sh .github/regenerate.sh 
Traceback (most recent call last):
  File "/Users/tingl/Documents/github/pytorch/.github/scripts/generate_ci_workflows.py", line 177, in <module>
    build_configs=generate_binary_build_matrix.generate_wheels_matrix(
  File "/Users/tingl/Documents/github/pytorch/.github/scripts/generate_binary_build_matrix.py", line 471, in generate_wheels_matrix
    "container_image": WHEEL_CONTAINER_IMAGES[arch_version],
KeyError: '12.1'

@tinglvv tinglvv marked this pull request as draft October 29, 2024 22:19
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good time to update CUDNN as well anyway?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, let's not mix different updates (CUDA and cuDNN) into the same PR, but follow up separately.

Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an exception in generate_conda_matrix to not include any 12.6 builds. We don't want to add new conda builds for 12.6

@tinglvv
Copy link
Copy Markdown
Collaborator Author

tinglvv commented Nov 8, 2024

Error for windows-binary-wheel might be due to #138458 which set 12.4 as default

@tinglvv
Copy link
Copy Markdown
Collaborator Author

tinglvv commented Nov 8, 2024

linux aarch64 failures should be resolved after correcting build script for aarch64.
windows-conda-build fails with
Run actions/upload-artifact@v4.4.0 Error: No files were found with the provided path: C:\actions-runner\_work\_temp/artifacts. No artifacts will be uploaded

@tinglvv
Copy link
Copy Markdown
Collaborator Author

tinglvv commented Nov 8, 2024

@pytorchbot rebase

@tinglvv tinglvv marked this pull request as ready for review November 8, 2024 21:55
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/138899/head returned non-zero exit code 1

Rebasing (1/16)
Rebasing (2/16)
Rebasing (3/16)
Rebasing (4/16)
Rebasing (5/16)
Rebasing (6/16)
Auto-merging .github/workflows/generated-linux-binary-conda-nightly.yml
Auto-merging .github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
Auto-merging .github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
Auto-merging .github/workflows/generated-linux-binary-manywheel-main.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-main.yml
Auto-merging .github/workflows/generated-linux-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-nightly.yml
Auto-merging .github/workflows/generated-windows-binary-conda-nightly.yml
Auto-merging .github/workflows/generated-windows-binary-libtorch-debug-main.yml
Auto-merging .github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
Auto-merging .github/workflows/generated-windows-binary-libtorch-release-main.yml
Auto-merging .github/workflows/generated-windows-binary-libtorch-release-nightly.yml
Auto-merging .github/workflows/generated-windows-binary-wheel-nightly.yml
error: could not apply 991a7019318... remove 12.1 from LINUX_BINARY_SMOKE_WORKFLOWS
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 991a7019318... remove 12.1 from LINUX_BINARY_SMOKE_WORKFLOWS

Raised by https://github.com/pytorch/pytorch/actions/runs/11750112212

tinglvv and others added 4 commits November 8, 2024 15:51
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
@tinglvv tinglvv changed the title Add CUDA 12.6 to Binaries Matrix Add CUDA 12.6 Linux Builds to Binaries Matrix Nov 12, 2024
"nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be bumping cusparselt here as well. Watch for unit test failures that https://hud.pytorch.org/pytorch/pytorch/pull/138175 is currently facing.

)
# Special build building to use on Colab. Python 3.11 for 12.1 CUDA
if python_version == "3.11" and arch_version == "12.1":
# Special build building to use on Colab. Python 3.11 for 12.4 CUDA
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be dependent on what Colab's support matrix is, e.g. does it support CUDA 12.4?
It may does support it, but it would be good to double check.

@atalman
Copy link
Copy Markdown
Contributor

atalman commented Nov 12, 2024

@pytorchmergebot merge -f "lint failure is expected"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

atalman added a commit to atalman/pytorch that referenced this pull request Nov 12, 2024
pytorchmergebot pushed a commit that referenced this pull request Nov 12, 2024
Fixes Lint after: #138899
Due to landrace.
Run ``./regenerate.sh``
Pull Request resolved: #140446
Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
facebook-github-bot pushed a commit to pytorch/FBGEMM that referenced this pull request Nov 21, 2024
Summary:
X-link: facebookresearch/FBGEMM#486

- Upgrade gcc version to support newer libstdc++, which is required now that
pytorch/pytorch#141035 has landed

- Deprecate support for CUDA 12.1 and add support for 12.6, per changes in
pytorch/pytorch#138899

Pull Request resolved: #3398

Reviewed By: sryap

Differential Revision: D66277492

Pulled By: q10

fbshipit-source-id: 24817efb5c07c1985ab3beeb1610879edbd81acc
@johnnynunez
Copy link
Copy Markdown
Contributor

which version finally? 12.6, 12.6.2 or 12.6.3?
In CES 2025, rtx50, rtx mobile and maybe nvidia arm will be released, so it expects at always that this month will be released cuda 12.7 (December) and with the new ones hardware will be released 12.8

@tinglvv
Copy link
Copy Markdown
Collaborator Author

tinglvv commented Dec 3, 2024

Hi @johnnynunez

which version finally? 12.6, 12.6.2 or 12.6.3? In CES 2025, rtx50, rtx mobile and maybe nvidia arm will be released, so it expects at always that this month will be released cuda 12.7 (December) and with the new ones hardware will be released 12.8

for x86 nightly build, it is 12.6.3 now - #141433. For windows builds, it is 12..6.2 as windows AMI takes time to build and may not make it before 2.6.0 code freeze. cc @atalman

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Related to pytorch#138440

Issue tracker: pytorch#138609

Version based on https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Pull Request resolved: pytorch#138899
Approved by: https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR Merged open source skip-pr-sanity-checks topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants