CUDA not found in NVIDIA runners


## Current Status
mitigated. Some jobs will have failures and need to be restarted. Any job after 5/16 2:20pm PT should have the correct runtime

## Error looks like
*Job failures with: `No CUDA runtime is found`

## Incident timeline (all times pacific)
*Include when the incident began, when it was detected, mitigated, root caused, and finally closed.*

started: 5/16 7:15am PT
detected: 5/16 11:48am PT
resolved: 5/16 2:20pm PT



## Root cause
An upgrade of nvidia-container-toolkit container 

## Mitigation
We pinned the version of nvidia-container-toolkit - see https://github.com/pytorch/test-infra/pull/6637 


# follow ups:
- run `docker run --rm -t --gpus=all python:3.11  nvidia-smi` when installing nvidia drivers.
- figure out if we should be alerted if any GPU runners runs with 0 utilization @wdvr @yangw-dev 
- Asking nvidia if their dependencies can be pinned: https://github.com/NVIDIA/nvidia-container-toolkit/issues/1091

cc @seemethere @malfet @pytorch/pytorch-dev-infra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA not found in NVIDIA runners #153760

Current Status

Error looks like

Incident timeline (all times pacific)

Root cause

Mitigation

follow ups:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CUDA not found in NVIDIA runners #153760

Description

Current Status

Error looks like

Incident timeline (all times pacific)

Root cause

Mitigation

follow ups:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions