Skip to content

Add CUDA 11.8 CI workflows#92137

Closed
ptrblck wants to merge 5 commits intopytorch:masterfrom
ptrblck:cu118_ci
Closed

Add CUDA 11.8 CI workflows#92137
ptrblck wants to merge 5 commits intopytorch:masterfrom
ptrblck:cu118_ci

Conversation

@ptrblck
Copy link
Copy Markdown
Collaborator

@ptrblck ptrblck commented Jan 13, 2023

Fixes #92090
CC @atalman

@ptrblck ptrblck requested review from a team and jeffdaily as code owners January 13, 2023 06:36
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jan 13, 2023
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 13, 2023

@matifali
Copy link
Copy Markdown

matifali commented Jan 13, 2023

cuda 12.0 is also released. Will we see a build for cuda-12?

@atalman
Copy link
Copy Markdown
Contributor

atalman commented Jan 13, 2023

@ptrblck change is required here as : https://github.com/pytorch/pytorch/blob/master/.github/workflows/docker-builds.yml#L36 in order to add these configs to workflow

@atalman atalman changed the title [WIP] add CUDA 11.8 CI workflows Add CUDA 11.8 CI workflows Jan 17, 2023
Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more changes are required for 11.8 we need to add cuda 11.8 change similar to 11.7 here : https://github.com/pytorch/pytorch/blob/master/.github/workflows/trunk.yml#L71
I think we need followup PR for this one this one with docker is merged

@atalman atalman added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jan 17, 2023
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 20, 2023
@atalman
Copy link
Copy Markdown
Contributor

atalman commented Jan 23, 2023

Please note CUDA 11.8 failure was adressed here #92264

Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@atalman
Copy link
Copy Markdown
Contributor

atalman commented Jan 23, 2023

@pytorchbot merge -f "Failures for 11.8 where resolved"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Comment on lines +151 to +157
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 3, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 3, runner: "windows.g5.4xlarge.nvidia.gpu" },
{ config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge" },
]}
Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three windows tests hards are failing, with errors like the following. Can you please take a look? Cuda doesn't seem to be installed correctly on the windows boxes

I think we'll need to disable these tests for now since they're failing consistently on periodic

2023-01-24T03:25:03.2580503Z ERROR (0.006s)
2023-01-24T03:25:03.2580979Z   test_multihead_attention_dtype_batch_first_cuda_float16 (__main__.TestMultiheadAttentionNNDeviceTypeCUDA) ...     test_multihead_attention_dtype_batch_first_cuda_float16 errored - num_retries_left: 2
2023-01-24T03:25:03.2581378Z Traceback (most recent call last):
2023-01-24T03:25:03.2581890Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 414, in instantiated_test
2023-01-24T03:25:03.2582234Z     raise rte
2023-01-24T03:25:03.2582586Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 401, in instantiated_test
2023-01-24T03:25:03.2582948Z     result = test(self, **param_kwargs)
2023-01-24T03:25:03.2583378Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 1010, in only_fn
2023-01-24T03:25:03.2583719Z     return fn(slf, *args, **kwargs)
2023-01-24T03:25:03.2584077Z   File "C:\actions-runner\_work\pytorch\pytorch\test\nn\test_multihead_attention.py", line 640, in test_multihead_attention_dtype_batch_first
2023-01-24T03:25:03.2584496Z     model = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True).cuda().to(dtype)
2023-01-24T03:25:03.2584961Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 1132, in to
2023-01-24T03:25:03.2585289Z     return self._apply(convert)
2023-01-24T03:25:03.2585661Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 784, in _apply
2023-01-24T03:25:03.2586057Z     module._apply(fn)
2023-01-24T03:25:03.2586468Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 807, in _apply
2023-01-24T03:25:03.2586789Z     param_applied = fn(param)
2023-01-24T03:25:03.2587196Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\nn\modules\module.py", line 1130, in convert
2023-01-24T03:25:03.2587602Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-01-24T03:25:03.2587932Z RuntimeError: CUDA error: no kernel image is available for execution on the device
2023-01-24T03:25:03.2588367Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2023-01-24T03:25:03.2588716Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2023-01-24T03:25:03.2589013Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

https://hud.pytorch.org/pytorch/pytorch/commit/d8aa68c683bdf31f237bffb734b6038bc4f63898

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZainRizvi looking into this issue, was there AMI update recently ? I don't see any failures on https://github.com/pytorch/pytorch/actions/runs/3961152467/jobs/6788616724

pytorchmergebot pushed a commit that referenced this pull request Jan 24, 2023
These periodic tests were introduced in #92137

They've been consistently failing on trunk, so disabling them until they're fixed. Sample failures: https://hud.pytorch.org/pytorch/pytorch/commit/d8aa68c683bdf31f237bffb734b6038bc4f63898
Pull Request resolved: #92902
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cuda 11.8 add to CI. Cuda 11.6 deprecation

7 participants