Add CUDA 11.8 CI workflows by ptrblck · Pull Request #92137 · pytorch/pytorch

ptrblck · 2023-01-13T06:36:15Z

Fixes #92090
CC @atalman

pytorch-bot · 2023-01-13T06:36:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92137

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 Failures

As of commit 99d84fa:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

matifali · 2023-01-13T13:42:19Z

cuda 12.0 is also released. Will we see a build for cuda-12?

atalman · 2023-01-13T14:50:22Z

@ptrblck change is required here as : https://github.com/pytorch/pytorch/blob/master/.github/workflows/docker-builds.yml#L36 in order to add these configs to workflow

atalman

LGTM

atalman

more changes are required for 11.8 we need to add cuda 11.8 change similar to 11.7 here : https://github.com/pytorch/pytorch/blob/master/.github/workflows/trunk.yml#L71
I think we need followup PR for this one this one with docker is merged

atalman · 2023-01-23T21:01:18Z

Please note CUDA 11.8 failure was adressed here #92264

atalman

LGTM

atalman · 2023-01-23T21:02:13Z

@pytorchbot merge -f "Failures for 11.8 where resolved"

pytorchmergebot · 2023-01-23T21:03:50Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ZainRizvi · 2023-01-24T16:24:52Z

.github/workflows/periodic.yml

+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 3, runner: "windows.g5.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 2, num_shards: 3, runner: "windows.g5.4xlarge.nvidia.gpu" },
+          { config: "default", shard: 3, num_shards: 3, runner: "windows.g5.4xlarge.nvidia.gpu" },
+          { config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
+        ]}


All three windows tests hards are failing, with errors like the following. Can you please take a look? Cuda doesn't seem to be installed correctly on the windows boxes

I think we'll need to disable these tests for now since they're failing consistently on periodic

2023-01-24T03:25:03.2580503Z ERROR (0.006s) 2023-01-24T03:25:03.2580979Z test_multihead_attention_dtype_batch_first_cuda_float16 (__main__.TestMultiheadAttentionNNDeviceTypeCUDA) ... test_multihead_attention_dtype_batch_first_cuda_float16 errored - num_retries_left: 2 2023-01-24T03:25:03.2581378Z Traceback (most recent call last): 2023-01-24T03:25:03.2581890Z File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 414, in instantiated_test 2023-01-24T03:25:03.2582234Z raise rte 2023-01-24T03:25:03.2582586Z File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 401, in instantiated_test 2023-01-24T03:25:03.2582948Z result = test(self, **param_kwargs) 2023-01-24T03:25:03.2583378Z File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 1010, in only_fn 2023-01-24T03:25:03.2583719Z return fn(slf, *args, **kwargs) 2023-01-24T03:25:03.2584077Z File "C:\actions-runner\_work\pytorch\pytorch\test\nn\test_multihead_attention.py", line 640, in test_multihead_attention_dtype_batch_first 2023-01-24T03:25:03.2584496Z model = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True).cuda().to(dtype) 2023-01-24T03:25:03.2584961Z File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 1132, in to 2023-01-24T03:25:03.2585289Z return self._apply(convert) 2023-01-24T03:25:03.2585661Z File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 784, in _apply 2023-01-24T03:25:03.2586057Z module._apply(fn) 2023-01-24T03:25:03.2586468Z File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 807, in _apply 2023-01-24T03:25:03.2586789Z param_applied = fn(param) 2023-01-24T03:25:03.2587196Z File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 1130, in convert 2023-01-24T03:25:03.2587602Z return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) 2023-01-24T03:25:03.2587932Z RuntimeError: CUDA error: no kernel image is available for execution on the device 2023-01-24T03:25:03.2588367Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2023-01-24T03:25:03.2588716Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2023-01-24T03:25:03.2589013Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

https://hud.pytorch.org/pytorch/pytorch/commit/d8aa68c683bdf31f237bffb734b6038bc4f63898

cc: @ptrblck @atalman

@ZainRizvi looking into this issue, was there AMI update recently ? I don't see any failures on https://github.com/pytorch/pytorch/actions/runs/3961152467/jobs/6788616724

These periodic tests were introduced in #92137 They've been consistently failing on trunk, so disabling them until they're fixed. Sample failures: https://hud.pytorch.org/pytorch/pytorch/commit/d8aa68c683bdf31f237bffb734b6038bc4f63898 Pull Request resolved: #92902 Approved by: https://github.com/malfet

add CUDA 11.8 CI workflows

c201818

ptrblck requested review from a team and jeffdaily as code owners January 13, 2023 06:36

pytorch-bot bot added the topic: not user facing topic category label Jan 13, 2023

pytorchbot added the open source label Jan 13, 2023

add cuda11.8 builds to docker-builds.yml

595696b

atalman changed the title ~~[WIP] add CUDA 11.8 CI workflows~~ Add CUDA 11.8 CI workflows Jan 17, 2023

atalman approved these changes Jan 17, 2023

View reviewed changes

atalman requested changes Jan 17, 2023

View reviewed changes

Add cuda 11.8 periodic workflow

640d668

atalman added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jan 17, 2023

atalman added 2 commits January 19, 2023 09:42

Fix build env for windows 11.8

77e73cc

Add cu118 jobs to trunk

99d84fa

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 20, 2023

ptrblck mentioned this pull request Jan 20, 2023

Update cudnn to 8.7.0.84 for CUDA 11.8 builds pytorch/builder#1271

Merged

atalman approved these changes Jan 23, 2023

View reviewed changes

pytorchmergebot added the Merged label Jan 23, 2023

pytorchmergebot closed this in 9bfd135 Jan 23, 2023

ZainRizvi reviewed Jan 24, 2023

View reviewed changes

ZainRizvi mentioned this pull request Jan 24, 2023

[CI] Disable regularly failing CUDA 11.8 windows periodic tests #92902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA 11.8 CI workflows#92137

Add CUDA 11.8 CI workflows#92137
ptrblck wants to merge 5 commits intopytorch:masterfrom
ptrblck:cu118_ci

ptrblck commented Jan 13, 2023

Uh oh!

pytorch-bot bot commented Jan 13, 2023 •

edited

Loading

Uh oh!

matifali commented Jan 13, 2023 •

edited

Loading

Uh oh!

atalman commented Jan 13, 2023

Uh oh!

atalman left a comment

Uh oh!

atalman left a comment •

edited

Loading

Uh oh!

atalman commented Jan 23, 2023

Uh oh!

atalman left a comment

Uh oh!

atalman commented Jan 23, 2023

Uh oh!

pytorchmergebot commented Jan 23, 2023

Uh oh!

ZainRizvi Jan 24, 2023 •

edited

Loading

Uh oh!

ZainRizvi Jan 24, 2023

Uh oh!

atalman Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

ptrblck commented Jan 13, 2023

Uh oh!

pytorch-bot bot commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92137

❌ 10 Failures

Uh oh!

matifali commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atalman commented Jan 13, 2023

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

atalman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atalman commented Jan 23, 2023

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

atalman commented Jan 23, 2023

Uh oh!

pytorchmergebot commented Jan 23, 2023

Merge started

Uh oh!

ZainRizvi Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZainRizvi Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

atalman Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pytorch-bot bot commented Jan 13, 2023 •

edited

Loading

matifali commented Jan 13, 2023 •

edited

Loading

atalman left a comment •

edited

Loading

ZainRizvi Jan 24, 2023 •

edited

Loading