Skip to content

add more logging fields that can be set in construction time#51260

Closed
zhaojuanmao wants to merge 12 commits intogh/zhaojuanmao/58/basefrom
gh/zhaojuanmao/58/head
Closed

add more logging fields that can be set in construction time#51260
zhaojuanmao wants to merge 12 commits intogh/zhaojuanmao/58/basefrom
gh/zhaojuanmao/58/head

Conversation

@zhaojuanmao
Copy link
Copy Markdown
Contributor

@zhaojuanmao zhaojuanmao commented Jan 28, 2021

Stack from ghstack:

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: D26118245

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 28, 2021

💊 CI failures summary and remediations

As of commit 0b04c41 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 28, 2021
zhaojuanmao added a commit that referenced this pull request Jan 28, 2021
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

ghstack-source-id: 120541380
Pull Request resolved: #51260
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Jan 28, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 120603119

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Jan 29, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 120693261

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 1, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 120805134

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)
Comment thread torch/lib/c10d/reducer.cpp Outdated
Comment thread c10/util/Logging.h
Comment thread c10/util/Logging.h
Comment thread c10/util/Logging.h
Comment thread torch/lib/c10d/reducer.cpp Outdated
Comment thread torch/lib/c10d/reducer.cpp Outdated
Comment thread torch/lib/c10d/reducer.cpp Outdated
Comment thread torch/lib/c10d/reducer.hpp Outdated
Comment thread torch/testing/_internal/distributed/distributed_test.py
Comment thread torch/testing/_internal/distributed/distributed_test.py
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

zhaojuanmao commented Feb 2, 2021

@rohan-varma thanks for reviewing the diff. I can see your major concern is about logging the env variables that depends on the specific backends.

Here are the reasons why I log them as it is right now:

  1. env variables are logged mainly for better debug ability. but env variable list is huge, as you can see an uncompleted list in "_dump_DDP_relevant_env_vars()" in "torch/nn/parallel/distributed.py", I only cherry picked important ones from the list as a starting point. ==> if we need more env variables logged later on, like "high priority stream" flags and etc, it can be easily added later on.
  2. DDPLoggingData struct is a generic data struct that could be used in different logging APIs, where users may need fixed schema. So I thought giving a fixed set of fields will be easier and flexible to manage.
  3. no matter the backend is gloo or nccl, sometimes the env variables are set statically in users' development environment so that they can switch gloo and nccl easily in their codes. So if the users specify the nccl as backend, some gloo env variables could still get valid values and are logged. If env variables are not set, I checked they are logged as "N/A" as expected.

For now, I think we can either 1) log the small set of env variables as the starting pointing in this diff 2) remove NCCL/Gloo env variables right now, add them back later on once you fleshed out debug ability design and list the variables we really need. I prefer 1) though, as we can start with them and see how useful they are.

How do you think?

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 2, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 120885387

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

regarding separating logging codes from reducer core logic, I think we can have a logger that wraps ddp_logging_data pointer and also reducer pointer, then put all logging related methods into logger. Let me try that

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 4, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121021915

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26118245/)!
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 4, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121022069

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26118245/)!
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 4, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121045907

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26118245/)!
@rohan-varma
Copy link
Copy Markdown
Contributor

@rohan-varma thanks for reviewing the diff. I can see your major concern is about logging the env variables that depends on the specific backends.

Here are the reasons why I log them as it is right now:

  1. env variables are logged mainly for better debug ability. but env variable list is huge, as you can see an uncompleted list in "_dump_DDP_relevant_env_vars()" in "torch/nn/parallel/distributed.py", I only cherry picked important ones from the list as a starting point. ==> if we need more env variables logged later on, like "high priority stream" flags and etc, it can be easily added later on.
  2. DDPLoggingData struct is a generic data struct that could be used in different logging APIs, where users may need fixed schema. So I thought giving a fixed set of fields will be easier and flexible to manage.
  3. no matter the backend is gloo or nccl, sometimes the env variables are set statically in users' development environment so that they can switch gloo and nccl easily in their codes. So if the users specify the nccl as backend, some gloo env variables could still get valid values and are logged. If env variables are not set, I checked they are logged as "N/A" as expected.

For now, I think we can either 1) log the small set of env variables as the starting pointing in this diff 2) remove NCCL/Gloo env variables right now, add them back later on once you fleshed out debug ability design and list the variables we really need. I prefer 1) though, as we can start with them and see how useful they are.

How do you think?

Sure, (1) sounds good to me for now and we can refactor later as needed.

Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/nn/parallel/distributed.py
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I'm sort of confused about a line in the test, please double check that, and some other small comments.

Comment thread torch/testing/_internal/distributed/distributed_test.py Outdated
Comment thread torch/lib/c10d/logger.cpp
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 5, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121102207

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26118245/)!
add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 8, 2021
Pull Request resolved: #51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Differential Revision: [D26118245](https://our.internmc.facebook.com/intern/diff/D26118245/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26118245/)!
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 18e0a61.

@facebook-github-bot facebook-github-bot deleted the gh/zhaojuanmao/58/head branch February 13, 2021 15:18
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
…#51260)

Summary:
Pull Request resolved: pytorch#51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
wubai pushed a commit to wubai/pytorch that referenced this pull request Aug 8, 2023
…#51260)

Summary:
Pull Request resolved: pytorch#51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…#51260)

Summary:
Pull Request resolved: pytorch#51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants