Skip to content

[C10D] Report detected failures when emitting collective end events.#109739

Closed
kumpera wants to merge 16 commits intogh/kumpera/64/basefrom
gh/kumpera/64/head
Closed

[C10D] Report detected failures when emitting collective end events.#109739
kumpera wants to merge 16 commits intogh/kumpera/64/basefrom
gh/kumpera/64/head

Conversation

@kumpera
Copy link
Contributor

@kumpera kumpera commented Sep 20, 2023

Stack from ghstack (oldest at bottom):

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 20, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109739

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 152d8c0 with merge base c151163 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kumpera pushed a commit that referenced this pull request Sep 20, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

ghstack-source-id: 581f5f8
Pull Request resolved: #109739
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
kumpera pushed a commit that referenced this pull request Sep 25, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

ghstack-source-id: cbf8737
Pull Request resolved: #109739
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
kumpera pushed a commit that referenced this pull request Sep 29, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

ghstack-source-id: 5b58562
Pull Request resolved: #109739
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
Rodrigo Kumpera added 2 commits October 2, 2023 08:31
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
@kumpera kumpera force-pushed the gh/kumpera/64/head branch from 4a43294 to c4601d7 Compare October 4, 2023 21:10
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
@kumpera
Copy link
Contributor Author

kumpera commented Oct 5, 2023

There's one major issue left on this PR: it doesn't work with gloo.

Rodrigo Kumpera added 2 commits October 5, 2023 10:42
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we ensure WatchDog does not abort the program and let us return error message?

…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 6, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

ghstack-source-id: 9052d68
Pull Request resolved: #109739
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 9, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

ghstack-source-id: faa4847
Pull Request resolved: #109739
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 10, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

with @kumpera

ghstack-source-id: 0c6e219
Pull Request resolved: #109739
…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 12, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

with @kumpera

ghstack-source-id: 2dc1339
Pull Request resolved: #109739
evt.operation = c10d::opTypeToString(work.retrieveOpType());
evt.drop_count = 0;
// isCompleted is mutable :facepalm:
if (const_cast<Work&>(work).isCompleted() && !work.isSuccess())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fduwjj I think we should discuss this part

we're calling work.exception() which will query some cuda APIs.

we're doing this from the main watchdog thread.

i think that in today's watchdog design this is already something we're doing in the main loop, so i don't have a big reservation about it.

but if you're redesigning the loops to put cuda accesses on another thread, you'd also want to move this part. would be good to think through how we'll handle the callback reporting error message safely, or else maybe not add the error to the callback in the first place. (I think it should be possible to do though)

…nd events."

We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Oct 15, 2023
We leverage the fact that errors get tracked in Work to report them.

The error message by itself is not crazy useful and we might need to
include some form of error code in the mix (nccl failure, timeout, etc).

with @kumpera

ghstack-source-id: ec2e912
Pull Request resolved: #109739
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 15, 2023
@github-actions github-actions bot closed this Jan 14, 2024
@github-actions github-actions bot deleted the gh/kumpera/64/head branch February 19, 2024 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants