🐛 Describe the bug
ProcessGroupNCCL watchdog polling can fail in ROCm when CUDA/HIP graph capture is active on another thread, due to hipEventQuery / hipEventSynchronize stream-capture mode restrictions.
In this path, watchdog polling relies on event query/sync from a side thread. On affected HIP runtime behavior, those calls can return capture-related errors (e.g., hipErrorStreamCaptureUnsupported) in cross-thread capture windows, even when watchdog thread mode is set to ThreadLocal/Relaxed. This leads to false watchdog failures/timeouts or requires a conservative framework workaround.
This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).
Minimal repro shape
- Thread A starts stream capture with
GLOBAL mode and keeps capture active.
- Thread B (watchdog-like thread) calls
hipEventQuery / hipEventSynchronize on an event associated with collective work.
- Observe capture-related error returns in scenarios where cross-thread behavior should not fail this way.
Observed behavior
- Watchdog-side event polling is not reliably safe during cross-thread capture windows.
- Runtime can return capture restriction errors and trigger false failure handling in the framework.
Expected behavior
- Cross-thread capture interaction should follow documented stream-capture mode semantics.
- Framework watchdog polling should not require extra conservative skipping logic to avoid runtime-induced failures.
Notes
HIP/CUDA APIs expose stream-scoped capture status (StreamIsCapturing, StreamGetCaptureInfo[_v2]) but not a process-wide "is any capture active" query, so framework-side mitigation is necessarily best-effort.
References:
Versions
Reproduced on ROCm/HIP runtime 7.2.26015 with PyTorch 2.12.0a0+gitcb798d7
cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang
🐛 Describe the bug
ProcessGroupNCCL watchdog polling can fail in ROCm when CUDA/HIP graph capture is active on another thread, due to
hipEventQuery/hipEventSynchronizestream-capture mode restrictions.In this path, watchdog polling relies on event query/sync from a side thread. On affected HIP runtime behavior, those calls can return capture-related errors (e.g.,
hipErrorStreamCaptureUnsupported) in cross-thread capture windows, even when watchdog thread mode is set toThreadLocal/Relaxed. This leads to false watchdog failures/timeouts or requires a conservative framework workaround.This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).
Minimal repro shape
GLOBALmode and keeps capture active.hipEventQuery/hipEventSynchronizeon an event associated with collective work.Observed behavior
Expected behavior
Notes
HIP/CUDA APIs expose stream-scoped capture status (
StreamIsCapturing,StreamGetCaptureInfo[_v2]) but not a process-wide "is any capture active" query, so framework-side mitigation is necessarily best-effort.References:
Versions
Reproduced on ROCm/HIP runtime 7.2.26015 with PyTorch 2.12.0a0+gitcb798d7
cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang