Skip to content

CUDA not found in NVIDIA runners #153760

@wdvr

Description

@wdvr

Current Status

mitigated. Some jobs will have failures and need to be restarted. Any job after 5/16 2:20pm PT should have the correct runtime

Error looks like

*Job failures with: No CUDA runtime is found

Incident timeline (all times pacific)

Include when the incident began, when it was detected, mitigated, root caused, and finally closed.

started: 5/16 7:15am PT
detected: 5/16 11:48am PT
resolved: 5/16 2:20pm PT

Root cause

An upgrade of nvidia-container-toolkit container

Mitigation

We pinned the version of nvidia-container-toolkit - see pytorch/test-infra#6637

follow ups:

cc @seemethere @malfet @pytorch/pytorch-dev-infra

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci: sevcritical failure affecting PyTorch CImodule: ciRelated to continuous integrationmodule: regressionIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions