Skip to content

Automatically rerun tests with CUDA_LAUNCH_BLOCKING=1 when they fail with CUDA errors in CI #49023

@ezyang

Description

@ezyang

CUDA errors are delayed and may occur several calls after the real error site. This can make it difficult to debug in CI if you can't reproduce locally. One way to make debugging easier for people is to (1) make sure we synchronize at the end of each test and (2) rerun the failing test with CUDA_LAUNCH_BLOCKING=1 so that you can find out exactly which CUDA call caused the assert error.

cc @ngimel @mruberry @VitalyFedyunin @walterddr

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generalmodule: testsIssues related to tests (not the torch.testing module)triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions