Reland "[C10] PG observability hooks. (#108815)" by wconstab · Pull Request #110907 · pytorch/pytorch

wconstab · 2023-10-09T23:23:44Z

Stack from ghstack (oldest at bottom):

-> Reland "[C10] PG observability hooks. (#108815)" #110907

This reverts commit ff0358b.

(original PR #108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

register_collective_start_hook
register_collective_end_hook
register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

pytorch-bot · 2023-10-09T23:23:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110907

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 9851f9f with merge base 733368a ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: b3f8a9c Pull Request resolved: #110907

fduwjj

LGTM

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: 4ffd0ed Pull Request resolved: #110907

wconstab · 2023-10-10T20:07:12Z

@pytorchbot merge

pytorchmergebot · 2023-10-10T20:09:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2023-10-11T00:21:51Z

@pytorchbot revert -m 'Sorry for reverting this, but macos job in trunk starts failing after this https://hud.pytorch.org/pytorch/pytorch/commit/7678cd22af46c9df4fb47a409d3e8ad71a6127ea' -c ignoresignal

pytorch-bot · 2023-10-11T00:21:53Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'ignoresignal' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

huydhn · 2023-10-11T00:22:05Z

@pytorchbot revert -m 'Sorry for reverting this, but macos job in trunk starts failing after this https://hud.pytorch.org/pytorch/pytorch/commit/7678cd22af46c9df4fb47a409d3e8ad71a6127ea' -c ignoredsignal

pytorchmergebot · 2023-10-11T00:23:36Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-10-11T00:23:48Z

@wconstab your PR has been successfully reverted.

This reverts commit 7678cd2. Reverted #110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this https://hud.pytorch.org/pytorch/pytorch/commit/7678cd22af46c9df4fb47a409d3e8ad71a6127ea ([comment](#110907 (comment)))

This reverts commit 314a502. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

…108815, #110907)"" This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 07012bb Pull Request resolved: #111072

…108815, #110907)"" This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 2c235b7 Pull Request resolved: #111072

This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: #111072 Approved by: https://github.com/malfet ghstack dependencies: #111061

…ytorch#2 "[C10] PG observability hooks. (pytorch#108815, pytorch#110907)" (pytorch#111072)" for test or build failures (pytorch#111393) Summary: This diff is reverting D50250526 D50250526: Reland pytorch#2 "[C10] PG observability hooks. (pytorch#108815, pytorch#110907)" (pytorch#111072) by wconstab has been identified to be causing the following test or build failures: Tests affected: - [cogwheel:cogwheel_ig_clips_tab_derived_feature_importance#test_ig_clips_tab_derived_feature_importance](https://www.internalfb.com/intern/test/844425021976403/) Here's the Multisect link: https://www.internalfb.com/multisect/3290230 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Differential Revision: D50299914 Pulled By: wconstab

…111072)" This reverts commit bb1424d. Reverted #111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](#111072 (comment)))

Reland "[C10] PG observability hooks. (#108815)"

8ca065e

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

wconstab requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners October 9, 2023 23:23

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 9, 2023

wconstab added a commit that referenced this pull request Oct 9, 2023

Reland "[C10] PG observability hooks. (#108815)"

4d883d7

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: b3f8a9c Pull Request resolved: #110907

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2023

fduwjj approved these changes Oct 9, 2023

View reviewed changes

Update on "Reland "[C10] PG observability hooks. (#108815)""

9851f9f

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

wconstab added a commit that referenced this pull request Oct 10, 2023

Reland "[C10] PG observability hooks. (#108815)"

89f04e4

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: 4ffd0ed Pull Request resolved: #110907

pytorchmergebot added the merging label Oct 10, 2023

pytorchmergebot added Merged and removed merging labels Oct 10, 2023

pytorchmergebot closed this in 7678cd2 Oct 10, 2023

pytorchmergebot added the Reverted label Oct 11, 2023

albanD mentioned this pull request Oct 11, 2023

DISABLED test_modules_can_be_imported (__main__.TestPublicBindings) #111022

Closed

facebook-github-bot deleted the gh/wconstab/200/head branch October 14, 2023 14:24

wconstab mentioned this pull request Oct 16, 2023

Revert D50250526: Multisect successfully blamed "D50250526: Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)" for test or build failures #111393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reland "[C10] PG observability hooks. (#108815)"#110907

Reland "[C10] PG observability hooks. (#108815)"#110907
wconstab wants to merge 2 commits intogh/wconstab/200/basefrom
gh/wconstab/200/head

wconstab commented Oct 9, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 9, 2023 •

edited

Loading

Uh oh!

fduwjj left a comment

Uh oh!

wconstab commented Oct 10, 2023

Uh oh!

pytorchmergebot commented Oct 10, 2023

Uh oh!

huydhn commented Oct 11, 2023

Uh oh!

pytorch-bot bot commented Oct 11, 2023

Uh oh!

huydhn commented Oct 11, 2023

Uh oh!

pytorchmergebot commented Oct 11, 2023

Uh oh!

pytorchmergebot commented Oct 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wconstab commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110907

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Oct 10, 2023

Uh oh!

pytorchmergebot commented Oct 10, 2023

Merge started

Uh oh!

huydhn commented Oct 11, 2023

Uh oh!

pytorch-bot bot commented Oct 11, 2023

Uh oh!

huydhn commented Oct 11, 2023

Uh oh!

pytorchmergebot commented Oct 11, 2023

Uh oh!

pytorchmergebot commented Oct 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Oct 9, 2023 •

edited

Loading

pytorch-bot bot commented Oct 9, 2023 •

edited

Loading