Skip to content

Add NCCL collective sequence number (seq_num) to Kineto profiler traces#177148

Closed
mdlogic wants to merge 1 commit intopytorch:mainfrom
mdlogic:export-D96145503
Closed

Add NCCL collective sequence number (seq_num) to Kineto profiler traces#177148
mdlogic wants to merge 1 commit intopytorch:mainfrom
mdlogic:export-D96145503

Conversation

@mdlogic
Copy link
Contributor

@mdlogic mdlogic commented Mar 11, 2026

Summary:
Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:

  • ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
    ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
    macros to populate them from the existing seq tuple.
  • util.h: Add kSeqNum constant ("Seq")
  • util.cpp: Emit seq_num in saveNcclMeta() when available
  • test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
    correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Differential Revision: D96145503

@mdlogic mdlogic requested a review from sraikund16 as a code owner March 11, 2026 17:29
@meta-codesync
Copy link

meta-codesync bot commented Mar 11, 2026

@mdlogic has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96145503.

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177148

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cbe0662 with merge base b2a70fa (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 11, 2026

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @mdlogic, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Mar 11, 2026
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Differential Revision: D96145503
@mdlogic mdlogic force-pushed the export-D96145503 branch 2 times, most recently from 3513bd3 to 72a123c Compare March 11, 2026 17:44
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:
Pull Request resolved: pytorch#177148

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Differential Revision: D96145503
@sanrise sanrise self-requested a review March 11, 2026 19:35
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
@scotts scotts requested review from scotts and removed request for sraikund16 March 11, 2026 19:53
@mdlogic mdlogic force-pushed the export-D96145503 branch 2 times, most recently from 7820fcf to dfa30a6 Compare March 11, 2026 21:07
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:
Pull Request resolved: pytorch#177148

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:
Pull Request resolved: pytorch#177148

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 11, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
…es (pytorch#177148)

Summary:
Pull Request resolved: pytorch#177148

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
…es (pytorch#177148)

Summary:
Pull Request resolved: pytorch#177148

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan: buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace — 2/2 pass

Reviewed By: hjli-creator

Differential Revision: D96145503
@mdlogic mdlogic force-pushed the export-D96145503 branch 2 times, most recently from 5726e18 to 0b6b4ab Compare March 12, 2026 02:19
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
…es (pytorch#177148)

Summary:

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan:
buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace
```
[marvindz@devvm34681.odn0 /data/repos/fbsource (294b04f486)]$ buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace
File changed: fbcode//caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py
File changed: fbsource//xplat/caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py
Buck UI: https://www.internalfb.com/buck2/1632f643-ed1b-49dc-83fe-dfb8344e6d45
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498759528389
Network: Up: 0B  Down: 3.0KiB  (reSessionID-df2787a7-3567-4f5a-b3d6-0f5143737a74)
Executing actions.   Remaining     0/2      
Command: test.     Finished 1 local                                                                              
Time elapsed: 1:09.4s
Tests finished: Pass 1. Fail 0. Timeout 0. Fatal 0. Skip 0. Omit 0. Infra Failure 0. Build failure 0
[marvindz@devvm34681.odn0 /data/repos/fbsource (a82bbe6994)]$
```

Reviewed By: hjli-creator

Differential Revision: D96145503
…es (pytorch#177148)

Summary:
Pull Request resolved: pytorch#177148

Thread the per-process-group sequence number from ProcessGroupNCCL through
ParamCommsDebugInfo into the Kineto trace JSON output.

This enables cross-rank correlation of collective operations: all ranks
participating in the same collective instance share the same seq_num
within a process group. Without this, there is no way to match collective
events across ranks in production trace data.

Changes:
- ParamCommsUtils.hpp: Add sequenceNumber_/isP2P_ fields and setter to
  ParamCommsDebugInfo. Update RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA
  macros to populate them from the existing seq tuple.
- util.h: Add kSeqNum constant ("Seq")
- util.cpp: Emit seq_num in saveNcclMeta() when available
- test_c10d_nccl_seq_num_trace.py: Automated test verifying seq_num appears
  correctly in chrome trace output

Test Plan:
buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace
```
[marvindz@devvm34681.odn0 /data/repos/fbsource (294b04f486)]$ buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/distributed/fb:test_c10d_nccl_seq_num_trace
File changed: fbcode//caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py
File changed: fbsource//xplat/caffe2/test/distributed/fb/test_c10d_nccl_seq_num_trace.py
Buck UI: https://www.internalfb.com/buck2/1632f643-ed1b-49dc-83fe-dfb8344e6d45
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498759528389
Network: Up: 0B  Down: 3.0KiB  (reSessionID-df2787a7-3567-4f5a-b3d6-0f5143737a74)
Executing actions.   Remaining     0/2
Command: test.     Finished 1 local
Time elapsed: 1:09.4s
Tests finished: Pass 1. Fail 0. Timeout 0. Fatal 0. Skip 0. Omit 0. Infra Failure 0. Build failure 0
[marvindz@devvm34681.odn0 /data/repos/fbsource (a82bbe6994)]$
```

Reviewed By: hjli-creator

Differential Revision: D96145503
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
Bump Kineto submodule from 0035505 to 2b15a60 to include
pytorch/kineto#1296 (seq_num propagation to GPU kernel events
in trace output). This is needed for pytorch#177148
(NCCL sequence number tracing).
This was referenced Mar 12, 2026
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
Bump Kineto submodule from 0035505 to 2b15a60 to include
pytorch/kineto#1296 (seq_num propagation to GPU kernel events
in trace output). This is needed for pytorch#177148
(NCCL sequence number tracing).
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
Bump Kineto submodule from 0035505 to 2b15a60 to include
pytorch/kineto#1296 (seq_num propagation to GPU kernel events
in trace output). This is needed for pytorch#177148
(NCCL sequence number tracing).
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 12, 2026
Bump Kineto submodule from 0035505 to 2b15a60 to include
pytorch/kineto#1296 (seq_num propagation to GPU kernel events
in trace output). This is needed for pytorch#177148
(NCCL sequence number tracing).
mdlogic added a commit to mdlogic/pytorch that referenced this pull request Mar 13, 2026
Bump Kineto submodule from 0035505 to 2b15a60 to include
pytorch/kineto#1296 (seq_num propagation to GPU kernel events
in trace output). This is needed for pytorch#177148
(NCCL sequence number tracing).
Copy link

@hjli-creator hjli-creator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Please merge this after the previous Kineto change is merged.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 13, 2026
pytorchmergebot pushed a commit that referenced this pull request Mar 13, 2026
Bump Kineto submodule from 0035505 to 2b15a60 to include pytorch/kineto#1296 (seq_num propagation to GPU kernel events in trace output).

This is needed so that #177148 (D96145503) can use the new Kineto APIs for NCCL sequence number tracing.

## Included kineto commits
- 2b15a60 Add seq_num propagation to GPU kernel events in Kineto trace output (#1296)
- 350b58f Refactor CuptiActivityProfiler.cpp to use CuptiCbidRegistry (#1297)
- 1f9ceb1 Use HAS_CUPTI_RANGE_PROFILER to avoid range profiler init (#1298)
- ebaac17 Add USDT log type to logger framework (#1285)
- e2e7e97 Revert D94566477: Add NCCL collective sequence number (seq_num) to Kineto profiler traces
- a7c5f4d Add NCCL collective sequence number (seq_num) to Kineto profiler traces (#1294)
Pull Request resolved: #177298
Approved by: https://github.com/sanrise, https://github.com/malfet
@mdlogic
Copy link
Contributor Author

mdlogic commented Mar 13, 2026

@scotts @sanrise can you help sign off on this one? The internal diff is approved and the kineto submodule PR has been merged now

Copy link
Contributor

@sanrise sanrise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - the Kineto submodule update supporting has already landed with PR #177298 - this is the final change in that stack to start recording sequence numbers per collective. This is required to start correlating record_param_comms events traces in a distributed setting when we have multiple traces per job (>128 ranks participating in an all reduce for example)

The unit tests look good on review.

@facebook-github-tools
Copy link

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants