Skip to content

GPU unit tests hang #6320

@vanbasten23

Description

@vanbasten23

🐛 Bug

After running a unit test (either a test suite or a single test), the terminal hangs. ctrl-C doesn't help as well.

Need to ctrl-z and manually kill the process running the test.

To Reproduce

On the HEAD (01/18), run root@xiowei-gpu:/ansible# GPU_NUM_DEVICES=4 PJRT_DEVICE=CUDA python pytorch/xla/test/pjrt/test_runtime_gpu.py TestExperimentalPjrtGpu.test_multi_gpu_devices would be able to reproduce it.

Expected behavior

It shouldn't hang. Previously, the expected behavior is;

root@xiowei-gpu:/ansible# GPU_NUM_DEVICES=4  PJRT_DEVICE=CUDA python pytorch/xla/test/pjrt/test_runtime_gpu.py TestExperimentalPjrtGpu.test_multi_gpu_devices
Running tests under Python 3.8.18: /usr/local/bin/python
[ RUN      ] TestExperimentalPjrtGpu.test_multi_gpu_devices
[       OK ] TestExperimentalPjrtGpu.test_multi_gpu_devices
----------------------------------------------------------------------
Ran 1 test in 4.433s

OK
root@xiowei-gpu:/ansible#

Environment

  • Reproducible on XLA backend [CPU/TPU]: GPU
  • torch_xla version: nightly

Additional context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions