Skip to content

[ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures #177309

@chinmaydk99

Description

@chinmaydk99

🐛 Describe the bug

ProcessGroupNCCL watchdog polling can fail in ROCm when CUDA/HIP graph capture is active on another thread, due to hipEventQuery / hipEventSynchronize stream-capture mode restrictions.

In this path, watchdog polling relies on event query/sync from a side thread. On affected HIP runtime behavior, those calls can return capture-related errors (e.g., hipErrorStreamCaptureUnsupported) in cross-thread capture windows, even when watchdog thread mode is set to ThreadLocal/Relaxed. This leads to false watchdog failures/timeouts or requires a conservative framework workaround.

This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).

Minimal repro shape

  1. Thread A starts stream capture with GLOBAL mode and keeps capture active.
  2. Thread B (watchdog-like thread) calls hipEventQuery / hipEventSynchronize on an event associated with collective work.
  3. Observe capture-related error returns in scenarios where cross-thread behavior should not fail this way.
Observed behavior
  • Watchdog-side event polling is not reliably safe during cross-thread capture windows.
  • Runtime can return capture restriction errors and trigger false failure handling in the framework.
Expected behavior
  • Cross-thread capture interaction should follow documented stream-capture mode semantics.
  • Framework watchdog polling should not require extra conservative skipping logic to avoid runtime-induced failures.

Notes

HIP/CUDA APIs expose stream-scoped capture status (StreamIsCapturing, StreamGetCaptureInfo[_v2]) but not a process-wide "is any capture active" query, so framework-side mitigation is necessarily best-effort.

References:

Versions

Reproduced on ROCm/HIP runtime 7.2.26015 with PyTorch 2.12.0a0+gitcb798d7

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang

Metadata

Metadata

Assignees

No one assigned

    Labels

    bot-triagedThis is a label only to be used by the auto triage botmodule: rocmAMD GPU support for Pytorchoncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions