[C10] PG observability hooks. by kumpera · Pull Request #108815 · pytorch/pytorch

kumpera · 2023-09-07T23:07:50Z

Stack from ghstack (oldest at bottom):

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

register_collective_start_hook
register_collective_end_hook
register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

pytorch-bot · 2023-09-07T23:07:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108815

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 32973c4 with merge base 901aa85 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / linux-focal-rocm5.6-py3.8 / test (default, 3, 3, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 7ccbd64 Pull Request resolved: #108815

Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: a7908ca Pull Request resolved: #108815

test/distributed/test_hooks.py

torch/csrc/distributed/c10d/Backend.cpp

Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

kumpera · 2023-09-12T18:04:13Z

There are multiple test failures on this PR that I'm working on.

Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 04f1745 Pull Request resolved: #108815

xw285cornell · 2023-09-13T22:39:02Z

This looks great! qq, it seems it's now writing to an os.pipe instead of committing to TCPStore. Previously it was the rank0 aggregate info and then tell everyone who's the stragger (which makes rank0 the bottleneck). Wondering how do we gather the global info from the pipe of each rank?

This reverts commit 0c7a877. Reverted #108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](#108815 (comment)))

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: b3f8a9c Pull Request resolved: #110907

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: 4ffd0ed Pull Request resolved: #110907

This reverts commit ff0358b. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: #110907 Approved by: https://github.com/fduwjj

This reverts commit 7678cd2. Reverted #110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this https://hud.pytorch.org/pytorch/pytorch/commit/7678cd22af46c9df4fb47a409d3e8ad71a6127ea ([comment](#110907 (comment)))

This reverts commit ff0358b. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit ff0358b. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 9b17d0d Pull Request resolved: #111069

This reverts commit 314a502. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

…108815, #110907)"" This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 07012bb Pull Request resolved: #111072

…108815, #110907)"" This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. [ghstack-poisoned]

This reverts commit 314a502. (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. ghstack-source-id: 2c235b7 Pull Request resolved: #111072

This reverts commit 314a502. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR #108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: #111072 Approved by: https://github.com/malfet ghstack dependencies: #111061

…ytorch#2 "[C10] PG observability hooks. (pytorch#108815, pytorch#110907)" (pytorch#111072)" for test or build failures (pytorch#111393) Summary: This diff is reverting D50250526 D50250526: Reland pytorch#2 "[C10] PG observability hooks. (pytorch#108815, pytorch#110907)" (pytorch#111072) by wconstab has been identified to be causing the following test or build failures: Tests affected: - [cogwheel:cogwheel_ig_clips_tab_derived_feature_importance#test_ig_clips_tab_derived_feature_importance](https://www.internalfb.com/intern/test/844425021976403/) Here's the Multisect link: https://www.internalfb.com/multisect/3290230 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Differential Revision: D50299914 Pulled By: wconstab

…111072)" This reverts commit bb1424d. Reverted #111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](#111072 (comment)))

kumpera requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners September 7, 2023 23:07

kumpera mentioned this pull request Sep 7, 2023

[Gloo] Properly pass op type to Work #108812

Closed

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Sep 7, 2023

This was referenced Sep 7, 2023

[C10D] Track pg name in c++. #108813

Closed

[C10d] Add PG::enableCollectivesTiming to make it dynamically enabled. #108814

Closed

wconstab reviewed Sep 8, 2023

View reviewed changes

test/distributed/test_hooks.py Show resolved Hide resolved

wconstab reviewed Sep 8, 2023

View reviewed changes

torch/csrc/distributed/c10d/Backend.cpp Outdated Show resolved Hide resolved

kumpera mentioned this pull request Sep 12, 2023

[C10d] Cleanup collective sequence number. #109136

Closed

pytorchmergebot added the Reverted label Oct 6, 2023

wconstab added a commit that referenced this pull request Oct 9, 2023

Reland "[C10] PG observability hooks. (#108815)"

744abcd

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

wconstab added a commit that referenced this pull request Oct 9, 2023

Reland "[C10] PG observability hooks. (#108815)"

8ca065e

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

wconstab added a commit that referenced this pull request Oct 9, 2023

Reland "[C10] PG observability hooks. (#108815)"

4d883d7

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: b3f8a9c Pull Request resolved: #110907

facebook-github-bot deleted the gh/kumpera/59/head branch October 10, 2023 14:25

wconstab added a commit that referenced this pull request Oct 10, 2023

Update on "Reland "[C10] PG observability hooks. (#108815)""

9851f9f

Add missing docs for new APIs This reverts commit ff0358b. [ghstack-poisoned]

wconstab added a commit that referenced this pull request Oct 10, 2023

Reland "[C10] PG observability hooks. (#108815)"

89f04e4

Add missing docs for new APIs This reverts commit ff0358b. ghstack-source-id: 4ffd0ed Pull Request resolved: #110907

wconstab mentioned this pull request Oct 10, 2023

Reland "[C10] PG observability hooks. (#108815)" #110907

Closed

wconstab mentioned this pull request Oct 11, 2023

Reland #2 "[C10] PG observability hooks. (#108815)" #111069

Closed

wconstab mentioned this pull request Oct 11, 2023

Reland #2 "[C10] PG observability hooks. (#108815, #110907)" #111072

Closed

wconstab mentioned this pull request Oct 16, 2023

Revert D50250526: Multisect successfully blamed "D50250526: Reland #2 "[C10] PG observability hooks. (#108815, #110907)" (#111072)" for test or build failures #111393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C10] PG observability hooks.#108815

[C10] PG observability hooks.#108815
kumpera wants to merge 16 commits intogh/kumpera/59/basefrom
gh/kumpera/59/head

kumpera commented Sep 7, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 7, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kumpera commented Sep 12, 2023

Uh oh!

xw285cornell commented Sep 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

kumpera commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108815

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

kumpera commented Sep 12, 2023

Uh oh!

xw285cornell commented Sep 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kumpera commented Sep 7, 2023 •

edited

Loading

pytorch-bot bot commented Sep 7, 2023 •

edited

Loading