Automatically rerun tests with CUDA_LAUNCH_BLOCKING=1 when they fail with CUDA errors in CI

CUDA errors are delayed and may occur several calls after the real error site. This can make it difficult to debug in CI if you can't reproduce locally. One way to make debugging easier for people is to (1) make sure we synchronize at the end of each test and (2) rerun the failing test with CUDA_LAUNCH_BLOCKING=1 so that you can find out exactly which CUDA call caused the assert error.

cc @ngimel @mruberry @VitalyFedyunin @walterddr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically rerun tests with CUDA_LAUNCH_BLOCKING=1 when they fail with CUDA errors in CI #49023

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Automatically rerun tests with CUDA_LAUNCH_BLOCKING=1 when they fail with CUDA errors in CI #49023

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions