Add NCCL collective sequence number (seq_num) to Kineto profiler traces#177148
Add NCCL collective sequence number (seq_num) to Kineto profiler traces#177148mdlogic wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177148
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit cbe0662 with merge base b2a70fa ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @mdlogic, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
2c5cf6b to
e0c292e
Compare
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Differential Revision: D96145503
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Differential Revision: D96145503
3513bd3 to
72a123c
Compare
…es (pytorch#177148) Summary: Pull Request resolved: pytorch#177148 Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Differential Revision: D96145503
72a123c to
78336db
Compare
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
7820fcf to
dfa30a6
Compare
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
dfa30a6 to
bb332b0
Compare
…es (pytorch#177148) Summary: Pull Request resolved: pytorch#177148 Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
bb332b0 to
3711761
Compare
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
3711761 to
e9e8c61
Compare
…es (pytorch#177148) Summary: Pull Request resolved: pytorch#177148 Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
e9e8c61 to
f9270e8
Compare
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
1858fbe to
5d1b211
Compare
…es (pytorch#177148) Summary: Pull Request resolved: pytorch#177148 Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
5d1b211 to
e689e3d
Compare
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
…es (pytorch#177148) Summary: Pull Request resolved: pytorch#177148 Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass Reviewed By: hjli-creator Differential Revision: D96145503
5726e18 to
0b6b4ab
Compare
…es (pytorch#177148) Summary: Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace ``` [marvindz@devvm34681.odn0 /data/repos/fbsource (294b04f486)]$ buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace File changed: fbcode//caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py File changed: fbsource//xplat/caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py Buck UI: https://www.internalfb.com/buck2/1632f643-ed1b-49dc-83fe-dfb8344e6d45 Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498759528389 Network: Up: 0B Down: 3.0KiB (reSessionID-df2787a7-3567-4f5a-b3d6-0f5143737a74) Executing actions. Remaining 0/2 Command: test. Finished 1 local Time elapsed: 1:09.4s Tests finished: Pass 1. Fail 0. Timeout 0. Fatal 0. Skip 0. Omit 0. Infra Failure 0. Build failure 0 [marvindz@devvm34681.odn0 /data/repos/fbsource (a82bbe6994)]$ ``` Reviewed By: hjli-creator Differential Revision: D96145503
…es (pytorch#177148) Summary: Pull Request resolved: pytorch#177148 Thread the per-process-group sequence number from ProcessGroupNCCL through ParamCommsDebugInfo into the Kineto trace JSON output. This enables cross-rank correlation of collective operations: all ranks participating in the same collective instance share the same seq_num within a process group. Without this, there is no way to match collective events across ranks in production trace data. Changes: - ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA macros to populate them from the existing seq tuple. - util.h: Add kSeqNum constant ("Seq") - util.cpp: Emit seq_num in saveNcclMeta() when available - test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears correctly in chrome trace output Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace ``` [marvindz@devvm34681.odn0 /data/repos/fbsource (294b04f486)]$ buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace File changed: fbcode//caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py File changed: fbsource//xplat/caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py Buck UI: https://www.internalfb.com/buck2/1632f643-ed1b-49dc-83fe-dfb8344e6d45 Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498759528389 Network: Up: 0B Down: 3.0KiB (reSessionID-df2787a7-3567-4f5a-b3d6-0f5143737a74) Executing actions. Remaining 0/2 Command: test. Finished 1 local Time elapsed: 1:09.4s Tests finished: Pass 1. Fail 0. Timeout 0. Fatal 0. Skip 0. Omit 0. Infra Failure 0. Build failure 0 [marvindz@devvm34681.odn0 /data/repos/fbsource (a82bbe6994)]$ ``` Reviewed By: hjli-creator Differential Revision: D96145503
0b6b4ab to
cbe0662
Compare
Bump Kineto submodule from 0035505 to 2b15a60 to include pytorch/kineto#1296 (seq_num propagation to GPU kernel events in trace output). This is needed for pytorch#177148 (NCCL sequence number tracing).
Bump Kineto submodule from 0035505 to 2b15a60 to include pytorch/kineto#1296 (seq_num propagation to GPU kernel events in trace output). This is needed for pytorch#177148 (NCCL sequence number tracing).
Bump Kineto submodule from 0035505 to 2b15a60 to include pytorch/kineto#1296 (seq_num propagation to GPU kernel events in trace output). This is needed for pytorch#177148 (NCCL sequence number tracing).
Bump Kineto submodule from 0035505 to 2b15a60 to include pytorch/kineto#1296 (seq_num propagation to GPU kernel events in trace output). This is needed for pytorch#177148 (NCCL sequence number tracing).
Bump Kineto submodule from 0035505 to 2b15a60 to include pytorch/kineto#1296 (seq_num propagation to GPU kernel events in trace output). This is needed for pytorch#177148 (NCCL sequence number tracing).
hjli-creator
left a comment
There was a problem hiding this comment.
This looks good to me. Please merge this after the previous Kineto change is merged.
Bump Kineto submodule from 0035505 to 2b15a60 to include pytorch/kineto#1296 (seq_num propagation to GPU kernel events in trace output). This is needed so that #177148 (D96145503) can use the new Kineto APIs for NCCL sequence number tracing. ## Included kineto commits - 2b15a60 Add seq_num propagation to GPU kernel events in Kineto trace output (#1296) - 350b58f Refactor CuptiActivityProfiler.cpp to use CuptiCbidRegistry (#1297) - 1f9ceb1 Use HAS_CUPTI_RANGE_PROFILER to avoid range profiler init (#1298) - ebaac17 Add USDT log type to logger framework (#1285) - e2e7e97 Revert D94566477: Add NCCL collective sequence number (seq_num) to Kineto profiler traces - a7c5f4d Add NCCL collective sequence number (seq_num) to Kineto profiler traces (#1294) Pull Request resolved: #177298 Approved by: https://github.com/sanrise, https://github.com/malfet
sanrise
left a comment
There was a problem hiding this comment.
LGTM - the Kineto submodule update supporting has already landed with PR #177298 - this is the final change in that stack to start recording sequence numbers per collective. This is required to start correlating record_param_comms events traces in a distributed setting when we have multiple traces per job (>128 ranks participating in an all reduce for example)
The unit tests look good on review.
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.
This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.
Changes:
ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
macros to populate them from the existing seq tuple.
correctly in chrome trace output
Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass
Differential Revision: D96145503