privateuse1 backend integration with kineto#172154
privateuse1 backend integration with kineto#172154divyanshk wants to merge 11 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172154
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b0b06d3 with merge base 5f68a4a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@divyanshk, what does this enable that wasn't previously possible? Is it currently the case that privateuse1 backends just don't work at all with |
|
@scotts For privateuse1 backends, we provide stubs so that users can use the legacy autograd profiler https://github.com/pytorch/pytorch/blob/main/docs/source/accelerator/profiler.md. If they want to use Kineto, they then have to have a kineto-compatible backend, which requires checking in the profiler implementation in kineto (AIU backend follows that later route). What I am hoping to enable is for backend users to use the latter route, without checking in a lot of code in Kineto. That is the entire philosophy behind Privateuse1 - we just provide extension points in pytorch core, users have their implementations in their own repo without us having to maintain it. |
How will users export a chrome trace without kineto? Is this impl for using the FunctionEvent frontend for now? |
|
Hi @divyanshk |
|
@ppnaik1890 Yes that is correct. How does that sound ? To be clear, here "outside repo" would be the AIU / IBM Pytorch backend implementation. I don't know if that exists right now (code pointer?) |
@divyanshk That we can move to our current dev org torch-spyre if the pattern is being followed for out of tree accelerators. |
dcc9d9d to
6d3c751
Compare
6d3c751 to
4955312
Compare
| // libkineto::IActivityProfiler. | ||
| struct RegisterPrivateUse1Profiler { | ||
| template <typename ProfilerClass> | ||
| explicit RegisterPrivateUse1Profiler(ProfilerClass*) { |
There was a problem hiding this comment.
Do we need to take a parameter? We don't actually use it. And when we instantiate this in the macro, we're passing nullptr. Can't we simplify this by just making this an empty constructor?
There was a problem hiding this comment.
I am using ProfilerClass for the compile time type assertion below.
There was a problem hiding this comment.
Yes, but that's available as a symbol because of template <typename ProfilerClass>. I don't think you need a dummy parameter to reference it. It also may be a bit more idiomatic to make it a template parameter to the class itself rather than the constructor, but I don't think it makes any actual functionality difference in our case.
There was a problem hiding this comment.
Ahh yes thats nice! Thank you. Templating the struct is much more cleaner than templating its constructor.
On templated constructor (what I had earlier), Out of curiosity, where you thinking of something like this
struct Foo {
template <typename T>
Foo() { /* use T somehow */ }
};
but this isn't valid ?
There was a problem hiding this comment.
I think that's valid? Did you try it and get a compilation error? It's not a requirement that a template parameter is the type of an actual parameter.
| // Forward registered PrivateUse1 profiler factory to Kineto. | ||
| // Only for KINETO_PRIVATEUSE1 state where backend provides its own | ||
| // IActivityProfiler. | ||
| #ifdef USE_KINETO |
There was a problem hiding this comment.
We called torch::profiler::impl::kineto::prepareTrace() above without a USE_KINETO check. Do we need to do that here? I think we should figure out the absolute minimum we need to do such checks.
There was a problem hiding this comment.
yeah - because torch::profiler::impl::kineto::prepareTrace() is for kineto_shim.cpp which can operate with USE_KINETO not set because it has almost every function with a ifdef guard - sad i know :-( check out this comment https://github.com/pytorch/pytorch/blob/main/torch/csrc/profiler/kineto_shim.cpp#L16
We might be able to get rid of kineto_shim altogether but that is a separate conversation.
There was a problem hiding this comment.
I can get rid of this because the if condition config.state == ProfilerState::KINETO_PRIVATEUSE1 is true only when kineto is present, but that will be at runtime. but for compile time correctness we would have to keep this USE_KINETO around PrivateUse1ProfilerRegistry.
There was a problem hiding this comment.
// Here lies pain and #ifdef USE_KINETO
Got an actual laugh-out-loud from me. :) Yeah, let's deal with cleaning this up later as its own thing.
|
This is great! I think this will greatly improve our profiler integrations. I have a bunch of small comments about the code itself, but we also need some tests. At the least, I think we need some C++ tests which mock up a trivial "external" profiler. If we can also somehow get that working with the Python side in tests as well, that would be great. |
4955312 to
8e4b002
Compare
| // Forward registered PrivateUse1 profiler factory to Kineto. | ||
| // Only for KINETO_PRIVATEUSE1 state where backend provides its own | ||
| // IActivityProfiler. | ||
| #ifdef USE_KINETO |
There was a problem hiding this comment.
yeah - because torch::profiler::impl::kineto::prepareTrace() is for kineto_shim.cpp which can operate with USE_KINETO not set because it has almost every function with a ifdef guard - sad i know :-( check out this comment https://github.com/pytorch/pytorch/blob/main/torch/csrc/profiler/kineto_shim.cpp#L16
We might be able to get rid of kineto_shim altogether but that is a separate conversation.
f97a00a to
9aef685
Compare
|
@divyanshk has imported this pull request. If you are a Meta employee, you can view this in D95825766. |
| test_custom_script_ops | ||
| test_custom_backend | ||
| test_torch_function_benchmark | ||
| test_libtorch_profiler |
There was a problem hiding this comment.
For dev infra folks, I am including profiler test as part of default shard 2 - the test is tiny. Here is the log:
+ /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/test_privateuse1_profiler --gtest_filter=PrivateUse1ProfilerTest.EndToEndProfiling
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = PrivateUse1ProfilerTest.EndToEndProfiling-*_CUDA:*_MultiCUDA
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from PrivateUse1ProfilerTest
[ RUN ] PrivateUse1ProfilerTest.EndToEndProfiling
[ OK ] PrivateUse1ProfilerTest.EndToEndProfiling (0 ms)
[----------] 1 test from PrivateUse1ProfilerTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[ PASSED ] 1 test.
+ /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/test_privateuse1_profiler --gtest_filter=-PrivateUse1ProfilerTest.EndToEndProfiling
CUDA not available. Disabling CUDA and MultiCUDA tests
Note: Google Test filter = -PrivateUse1ProfilerTest.EndToEndProfiling:*_CUDA:*_MultiCUDA
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from PrivateUse1ProfilerTest
[ RUN ] PrivateUse1ProfilerTest.RegistrySingleton
[ OK ] PrivateUse1ProfilerTest.RegistrySingleton (0 ms)
[ RUN ] PrivateUse1ProfilerTest.RegisterFactory
[ OK ] PrivateUse1ProfilerTest.RegisterFactory (0 ms)
[ RUN ] PrivateUse1ProfilerTest.OnKinetoInitForwarding
[ OK ] PrivateUse1ProfilerTest.OnKinetoInitForwarding (0 ms)
[----------] 3 tests from PrivateUse1ProfilerTest (0 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (0 ms total)
[ PASSED ] 3 tests.
Including two separate processes because one of the test (EndToEndProfiling) requires to exist in its own - it does a e2e test, having other smaller unit tests run with it can pollute it.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx950.1) Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Propose a profiling toolkit for Spyre covering the full stack: PyTorch Profiler integration via REGISTER_PRIVATEUSE1_PROFILER (pytorch/pytorch#172154), Spyre SMI, IR instrumentation-based fine-grained profiler, memory profiling (DDR + scratchpad), Inductor provenance tracking, HTA integration, FFDC, and multi-card/energy profiling. Covers both the current OpSpec/SuperDSC pipeline and the future KTIR transition (RFC 0682). Tracking issue: #1048 Co-Authored-By: @ppnaik1890 and @flop1971 Signed-off-by: kaoutar55 <kaoutar.elmaghraoui@gmail.com>
Propose a profiling toolkit for Spyre covering the full stack: PyTorch Profiler integration via REGISTER_PRIVATEUSE1_PROFILER (pytorch/pytorch#172154), Spyre SMI, IR instrumentation-based fine-grained profiler, memory profiling (DDR + scratchpad), Inductor provenance tracking, HTA integration, FFDC, and multi-card/energy profiling. Covers both the current OpSpec/SuperDSC pipeline and the future KTIR transition (RFC 0682). Tracking issue: #1048 Co-Authored-By: @ppnaik1890 and @flop1971 Signed-off-by: kaoutar55 <kaoutar.elmaghraoui@gmail.com>
1. Created privateuse1_profiler.h/.cpp — A registry pattern that allows PrivateUse1 backends to register IActivityProfiler factories via REGISTER_PRIVATEUSE1_PROFILER(MyProfiler) macro, with compile-time static_assert ensuring the class inherits from libkineto::IActivityProfiler.
* This makes the assumption that backends will take a dependency on Kineto to use IActivityProfiler interface. Right now the backends have to check in their implementation to Kineto - so this might be a step up and a safe assumption.
* As an alternative, PyTorch could define its own abstract interface that mirrors IActivityProfiler, then internally forward to Kineto.
2. Kineto init paths — Added onKinetoInit() calls in kineto_shim.cpp (user-triggered profiling via prepareTrace()), but _not_ for kineto_client_interface.cpp (daemon mode via global_kineto_init()), with guards to ensure Kineto is initialized before forwarding.
TODO
1. [Done] Gate this behind a new ProfilerState::KINETO_PRIVATEUSE1 check
2. [Done] Check how (if at all) kineto build args need to change. Mostly it shouldn't as for privateuse1 we wont need CUDA/ROCm/XPU etc.
3. [Done] How does this break kineto's fbcode setup? Not applicable
Pull Request resolved: pytorch#172154
Approved by: https://github.com/scotts
TODO