Skip to content

privateuse1 backend integration with kineto#172154

Closed
divyanshk wants to merge 11 commits intopytorch:mainfrom
divyanshk:divyanshk/profiler_privateuse1_reg
Closed

privateuse1 backend integration with kineto#172154
divyanshk wants to merge 11 commits intopytorch:mainfrom
divyanshk:divyanshk/profiler_privateuse1_reg

Conversation

@divyanshk
Copy link
Copy Markdown
Contributor

@divyanshk divyanshk commented Jan 10, 2026

  1. Created privateuse1_profiler.h/.cpp — A registry pattern that allows PrivateUse1 backends to register IActivityProfiler factories via REGISTER_PRIVATEUSE1_PROFILER(MyProfiler) macro, with compile-time static_assert ensuring the class inherits from libkineto::IActivityProfiler.
    • This makes the assumption that backends will take a dependency on Kineto to use IActivityProfiler interface. Right now the backends have to check in their implementation to Kineto - so this might be a step up and a safe assumption.
    • As an alternative, PyTorch could define its own abstract interface that mirrors IActivityProfiler, then internally forward to Kineto.
  2. Kineto init paths — Added onKinetoInit() calls in kineto_shim.cpp (user-triggered profiling via prepareTrace()), but not for kineto_client_interface.cpp (daemon mode via global_kineto_init()), with guards to ensure Kineto is initialized before forwarding.

TODO

  1. [Done] Gate this behind a new ProfilerState::KINETO_PRIVATEUSE1 check
  2. [Done] Check how (if at all) kineto build args need to change. Mostly it shouldn't as for privateuse1 we wont need CUDA/ROCm/XPU etc.
  3. [Done] How does this break kineto's fbcode setup? Not applicable

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172154

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b0b06d3 with merge base 5f68a4a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@divyanshk divyanshk added release notes: profiler release notes category topic: not user facing topic category labels Jan 10, 2026
@pytorch pytorch deleted a comment from github-actions bot Jan 10, 2026
@scotts
Copy link
Copy Markdown
Contributor

scotts commented Jan 12, 2026

@divyanshk, what does this enable that wasn't previously possible? Is it currently the case that privateuse1 backends just don't work at all with torch.profile()?

@divyanshk
Copy link
Copy Markdown
Contributor Author

divyanshk commented Jan 12, 2026

@scotts For privateuse1 backends, we provide stubs so that users can use the legacy autograd profiler https://github.com/pytorch/pytorch/blob/main/docs/source/accelerator/profiler.md. If they want to use Kineto, they then have to have a kineto-compatible backend, which requires checking in the profiler implementation in kineto (AIU backend follows that later route). What I am hoping to enable is for backend users to use the latter route, without checking in a lot of code in Kineto. That is the entire philosophy behind Privateuse1 - we just provide extension points in pytorch core, users have their implementations in their own repo without us having to maintain it.

@sraikund16
Copy link
Copy Markdown
Contributor

@scotts For privateuse1 backends, we provide stubs so that users can use the legacy autograd profiler https://github.com/pytorch/pytorch/blob/main/docs/source/accelerator/profiler.md. If they want to use Kineto, they then have to have a kineto-compatible backend, which requires checking in the profiler implementation in kineto (AIU backend fillows that later route). What I am hoping to enable is for backend users to use the latter route, without checking in a lot of code in Kineto. That is the entire philosophy behind Privateuse1 - we just provide extension points in pytorch core, users have their implementations in their own repo without us having to maintain it.

How will users export a chrome trace without kineto? Is this impl for using the FunctionEvent frontend for now?

@ppnaik1890
Copy link
Copy Markdown

ppnaik1890 commented Jan 22, 2026

Hi @divyanshk
So to confirm our understanding, we register our backend using REGISTER_PRIVATEUSE1_PROFILER and enable integration with IActivityProfiler . The registration and integration code for the backend needs to reside in an outside repo?
For example, this plugin will be moved out of the repo?

@divyanshk
Copy link
Copy Markdown
Contributor Author

@ppnaik1890 Yes that is correct. How does that sound ?

To be clear, here "outside repo" would be the AIU / IBM Pytorch backend implementation. I don't know if that exists right now (code pointer?)

@raghukiran1224
Copy link
Copy Markdown
Collaborator

To be clear, here "outside repo" would be the AIU / IBM Pytorch backend implementation. I don't know if that exists right now (code pointer?)

@divyanshk That we can move to our current dev org torch-spyre if the pattern is being followed for out of tree accelerators.

@divyanshk divyanshk marked this pull request as ready for review March 5, 2026 17:55
// libkineto::IActivityProfiler.
struct RegisterPrivateUse1Profiler {
template <typename ProfilerClass>
explicit RegisterPrivateUse1Profiler(ProfilerClass*) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to take a parameter? We don't actually use it. And when we instantiate this in the macro, we're passing nullptr. Can't we simplify this by just making this an empty constructor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using ProfilerClass for the compile time type assertion below.

Copy link
Copy Markdown
Contributor

@scotts scotts Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's available as a symbol because of template <typename ProfilerClass>. I don't think you need a dummy parameter to reference it. It also may be a bit more idiomatic to make it a template parameter to the class itself rather than the constructor, but I don't think it makes any actual functionality difference in our case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes thats nice! Thank you. Templating the struct is much more cleaner than templating its constructor.

On templated constructor (what I had earlier), Out of curiosity, where you thinking of something like this

struct Foo {
  template <typename T>
  Foo() { /* use T somehow */ }
};

but this isn't valid ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's valid? Did you try it and get a compilation error? It's not a requirement that a template parameter is the type of an actual parameter.

// Forward registered PrivateUse1 profiler factory to Kineto.
// Only for KINETO_PRIVATEUSE1 state where backend provides its own
// IActivityProfiler.
#ifdef USE_KINETO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We called torch::profiler::impl::kineto::prepareTrace() above without a USE_KINETO check. Do we need to do that here? I think we should figure out the absolute minimum we need to do such checks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - because torch::profiler::impl::kineto::prepareTrace() is for kineto_shim.cpp which can operate with USE_KINETO not set because it has almost every function with a ifdef guard - sad i know :-( check out this comment https://github.com/pytorch/pytorch/blob/main/torch/csrc/profiler/kineto_shim.cpp#L16

We might be able to get rid of kineto_shim altogether but that is a separate conversation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get rid of this because the if condition config.state == ProfilerState::KINETO_PRIVATEUSE1 is true only when kineto is present, but that will be at runtime. but for compile time correctness we would have to keep this USE_KINETO around PrivateUse1ProfilerRegistry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Here lies pain and #ifdef USE_KINETO

Got an actual laugh-out-loud from me. :) Yeah, let's deal with cleaning this up later as its own thing.

@scotts
Copy link
Copy Markdown
Contributor

scotts commented Mar 5, 2026

This is great! I think this will greatly improve our profiler integrations. I have a bunch of small comments about the code itself, but we also need some tests. At the least, I think we need some C++ tests which mock up a trivial "external" profiler. If we can also somehow get that working with the Python side in tests as well, that would be great.

@divyanshk divyanshk force-pushed the divyanshk/profiler_privateuse1_reg branch from 4955312 to 8e4b002 Compare March 7, 2026 03:02
// Forward registered PrivateUse1 profiler factory to Kineto.
// Only for KINETO_PRIVATEUSE1 state where backend provides its own
// IActivityProfiler.
#ifdef USE_KINETO
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - because torch::profiler::impl::kineto::prepareTrace() is for kineto_shim.cpp which can operate with USE_KINETO not set because it has almost every function with a ifdef guard - sad i know :-( check out this comment https://github.com/pytorch/pytorch/blob/main/torch/csrc/profiler/kineto_shim.cpp#L16

We might be able to get rid of kineto_shim altogether but that is a separate conversation.

@divyanshk divyanshk force-pushed the divyanshk/profiler_privateuse1_reg branch 2 times, most recently from f97a00a to 9aef685 Compare March 9, 2026 17:52
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 9, 2026

@divyanshk has imported this pull request. If you are a Meta employee, you can view this in D95825766.

test_custom_script_ops
test_custom_backend
test_torch_function_benchmark
test_libtorch_profiler
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dev infra folks, I am including profiler test as part of default shard 2 - the test is tiny. Here is the log:

+ /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/test_privateuse1_profiler --gtest_filter=PrivateUse1ProfilerTest.EndToEndProfiling
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = PrivateUse1ProfilerTest.EndToEndProfiling-*_CUDA:*_MultiCUDA
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from PrivateUse1ProfilerTest
[ RUN      ] PrivateUse1ProfilerTest.EndToEndProfiling
[       OK ] PrivateUse1ProfilerTest.EndToEndProfiling (0 ms)
[----------] 1 test from PrivateUse1ProfilerTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.
+ /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/test_privateuse1_profiler --gtest_filter=-PrivateUse1ProfilerTest.EndToEndProfiling
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = -PrivateUse1ProfilerTest.EndToEndProfiling:*_CUDA:*_MultiCUDA
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from PrivateUse1ProfilerTest
[ RUN      ] PrivateUse1ProfilerTest.RegistrySingleton
[       OK ] PrivateUse1ProfilerTest.RegistrySingleton (0 ms)
[ RUN      ] PrivateUse1ProfilerTest.RegisterFactory
[       OK ] PrivateUse1ProfilerTest.RegisterFactory (0 ms)
[ RUN      ] PrivateUse1ProfilerTest.OnKinetoInitForwarding
[       OK ] PrivateUse1ProfilerTest.OnKinetoInitForwarding (0 ms)
[----------] 3 tests from PrivateUse1ProfilerTest (0 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (0 ms total)
[  PASSED  ] 3 tests.

Including two separate processes because one of the test (EndToEndProfiling) requires to exist in its own - it does a e2e test, having other smaller unit tests run with it can pollute it.

@divyanshk
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 10, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx950.1)

Details for Dev Infra team Raised by workflow job

@divyanshk divyanshk added the ciflow/rocm Trigger "default" config CI on ROCm label Mar 11, 2026
@pytorch pytorch deleted a comment from pytorch-bot bot Mar 11, 2026
@divyanshk divyanshk added ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners and removed ciflow/rocm Trigger "default" config CI on ROCm labels Mar 11, 2026
@divyanshk
Copy link
Copy Markdown
Contributor Author

@pytorchmergebot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kaoutar55 added a commit to torch-spyre/torch-spyre that referenced this pull request Mar 16, 2026
Propose a profiling toolkit for Spyre covering the full stack:
PyTorch Profiler integration via REGISTER_PRIVATEUSE1_PROFILER
(pytorch/pytorch#172154), Spyre SMI, IR instrumentation-based
fine-grained profiler, memory profiling (DDR + scratchpad),
Inductor provenance tracking, HTA integration, FFDC, and
multi-card/energy profiling. Covers both the current
OpSpec/SuperDSC pipeline and the future KTIR transition (RFC 0682).

Tracking issue: #1048

Co-Authored-By: @ppnaik1890 and @flop1971
Signed-off-by: kaoutar55 <kaoutar.elmaghraoui@gmail.com>
kaoutar55 added a commit to kaoutar55/RFCs that referenced this pull request Mar 18, 2026
Propose a profiling toolkit for Spyre covering the full stack:
PyTorch Profiler integration via REGISTER_PRIVATEUSE1_PROFILER
(pytorch/pytorch#172154), Spyre SMI, IR instrumentation-based
fine-grained profiler, memory profiling (DDR + scratchpad),
Inductor provenance tracking, HTA integration, FFDC, and
multi-card/energy profiling. Covers both the current
OpSpec/SuperDSC pipeline and the future KTIR transition (RFC 0682).

Tracking issue: #1048

Co-Authored-By: @ppnaik1890 and @flop1971
Signed-off-by: kaoutar55 <kaoutar.elmaghraoui@gmail.com>
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
1. Created privateuse1_profiler.h/.cpp — A registry pattern that allows PrivateUse1 backends to register IActivityProfiler factories via REGISTER_PRIVATEUSE1_PROFILER(MyProfiler) macro, with compile-time static_assert ensuring the class inherits from libkineto::IActivityProfiler.
    * This makes the assumption that backends will take a dependency on Kineto to use IActivityProfiler interface. Right now the backends have to check in their implementation to Kineto - so this might be a step up and a safe assumption.
    * As an alternative, PyTorch could define its own abstract interface that mirrors IActivityProfiler, then internally forward to Kineto.
2. Kineto init paths — Added onKinetoInit() calls in kineto_shim.cpp (user-triggered profiling via prepareTrace()), but _not_ for kineto_client_interface.cpp (daemon mode via global_kineto_init()), with guards to ensure Kineto is initialized before forwarding.

TODO
1. [Done] Gate this behind a new ProfilerState::KINETO_PRIVATEUSE1 check
2. [Done] Check how (if at all) kineto build args need to change. Mostly it shouldn't as for privateuse1 we wont need CUDA/ROCm/XPU etc.
3. [Done] How does this break kineto's fbcode setup? Not applicable
Pull Request resolved: pytorch#172154
Approved by: https://github.com/scotts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: profiler release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants