[ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures

### 🐛 Describe the bug

ProcessGroupNCCL watchdog polling can fail in ROCm when CUDA/HIP graph capture is active on another thread, due to `hipEventQuery` / `hipEventSynchronize` stream-capture mode restrictions.

In this path, watchdog polling relies on event query/sync from a side thread. On affected HIP runtime behavior, those calls can return capture-related errors (e.g., `hipErrorStreamCaptureUnsupported`) in cross-thread capture windows, even when watchdog thread mode is set to `ThreadLocal`/`Relaxed`. This leads to false watchdog failures/timeouts or requires a conservative framework workaround.

This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).

### Minimal repro shape

1. Thread A starts stream capture with `GLOBAL` mode and keeps capture active.
2. Thread B (watchdog-like thread) calls `hipEventQuery` / `hipEventSynchronize` on an event associated with collective work.
3. Observe capture-related error returns in scenarios where cross-thread behavior should not fail this way.

##### Observed behavior

- Watchdog-side event polling is not reliably safe during cross-thread capture windows.
- Runtime can return capture restriction errors and trigger false failure handling in the framework.

##### Expected behavior

- Cross-thread capture interaction should follow documented stream-capture mode semantics.
- Framework watchdog polling should not require extra conservative skipping logic to avoid runtime-induced failures.

### Notes

HIP/CUDA APIs expose stream-scoped capture status (`StreamIsCapturing`, `StreamGetCaptureInfo[_v2]`) but not a process-wide "is any capture active" query, so framework-side mitigation is necessarily best-effort.

References:
- PyTorch workaround context: https://github.com/pytorch/pytorch/pull/176251

### Versions

Reproduced on ROCm/HIP runtime 7.2.26015 with PyTorch 2.12.0a0+gitcb798d7

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures #177309

🐛 Describe the bug

Minimal repro shape

Observed behavior

Expected behavior

Notes

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures #177309

Description

🐛 Describe the bug

Minimal repro shape

Observed behavior

Expected behavior

Notes

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions