-
Notifications
You must be signed in to change notification settings - Fork 28k
CUDA not found in NVIDIA runners #153760
Copy link
Copy link
Closed
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CImodule: ciRelated to continuous integrationRelated to continuous integrationmodule: regressionIt used to work, and now it doesn'tIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CImodule: ciRelated to continuous integrationRelated to continuous integrationmodule: regressionIt used to work, and now it doesn'tIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Fields
Give feedbackNo fields configured for issues without a type.
Projects
StatusShow more project fields
Done
Current Status
mitigated. Some jobs will have failures and need to be restarted. Any job after 5/16 2:20pm PT should have the correct runtime
Error looks like
*Job failures with:
No CUDA runtime is foundIncident timeline (all times pacific)
Include when the incident began, when it was detected, mitigated, root caused, and finally closed.
started: 5/16 7:15am PT
detected: 5/16 11:48am PT
resolved: 5/16 2:20pm PT
Root cause
An upgrade of nvidia-container-toolkit container
Mitigation
We pinned the version of nvidia-container-toolkit - see pytorch/test-infra#6637
follow ups:
docker run --rm -t --gpus=all python:3.11 nvidia-smiwhen installing nvidia drivers.cc @seemethere @malfet @pytorch/pytorch-dev-infra