Reland #2 "[C10] PG observability hooks. (#108815)" by wconstab · Pull Request #111069 · pytorch/pytorch

wconstab · 2023-10-11T18:57:23Z

Stack from ghstack (oldest at bottom):

This reverts commit ff0358b.

(original PR #108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

register_collective_start_hook
register_collective_end_hook
register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

This reverts commit ff0358b. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

pytorch-bot · 2023-10-11T18:57:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111069

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 11 Unrelated Failures

As of commit 016e261 with merge base fd4ba80 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
pull / linux-focal-py3.8-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
pull / linux-focal-py3.8-clang10 / test (default, 2, 3, linux.2xlarge) (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This reverts commit ff0358b. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 9b17d0d Pull Request resolved: #111069

wconstab requested review from H-Huang, awgu, fegin, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners October 11, 2023 18:57

wconstab mentioned this pull request Oct 11, 2023

Move cuda driver exit handling from helpers to threads #111061

Closed

wconstab requested review from d4l3k, fduwjj, kiukchung and wz337 as code owners October 11, 2023 18:57

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 11, 2023

wconstab closed this Oct 11, 2023

wconstab deleted the gh/wconstab/204/head branch October 11, 2023 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reland #2 "[C10] PG observability hooks. (#108815)"#111069

Reland #2 "[C10] PG observability hooks. (#108815)"#111069
wconstab wants to merge 1 commit intogh/wconstab/204/basefrom
gh/wconstab/204/head

wconstab commented Oct 11, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 11, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wconstab commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111069

❌ 3 New Failures, 11 Unrelated Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wconstab commented Oct 11, 2023 •

edited

Loading

pytorch-bot bot commented Oct 11, 2023 •

edited

Loading