Fix for AOTI + CUDAGraphs when calling from Python by jbschlosser · Pull Request #148601 · pytorch/pytorch

jbschlosser · 2025-03-05T20:45:35Z

Stack from ghstack (oldest at bottom):

-> Fix for AOTI + CUDAGraphs when calling from Python #148601

Background: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:

Error: operation not permitted when stream is capturing

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in AOTIModelContainer / AOTIModel) conflicting with the use of CUDAGraphs.

Fix: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:

Python side introduces a new flag to enable this behavior (needs a better name): torch._inductor.package.load_package(..., run_single_threaded=False)
- This flag is passed down to the C++ side's AOTIModelPackageLoader, which passes it to the CreateAOTIModelRunnerFunc during AOTIModelContainerRunner construction.
C++ side introduces single-threaded alternatives to model running and model container running:
- AOTIModelContainer.run_single_threaded() / AOTIModel.run_single_threaded(). The interfaces match those of run(), but the synchronization logic has been removed.
- Introduces AOTInductorModelContainerRunSingleThreaded to AOTI's interface.h; this is invoked by the AOTIModelContainerRunner utility class when run_single_threaded=true.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

Future work:

Flip default value to run_single_threaded=True as Python-side inference doesn't take advantage of the AOTI runtime thread pool
- There are some BC concerns here - models need to be re-serialized so the .so contains the new AOTInductorModelContainerRunSingleThreaded interface func. We can flip the default value and warn (instead of crashing) if the AOTInductorModelContainerRunSingleThreaded symbol does not exist.
Compose with cudagraph trees as opposed to manual cuda graph wrapping

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-03-05T20:45:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148601

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8379d55 with merge base c65ee72 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build (gh) (#148495)
undefined reference to std::__throw_bad_array_new_length()'`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-03-05T20:45:40Z

The committers listed above are authorized under a signed CLA.

✅ login: jbschlosser / name: Joel Schlosser (2d14cd3, f90a50e, 8379d55, 91c630e)

TODO: tests. I'm currently verifying this on my real-world Flux use case but I'll distill this down to a nice test afterwards. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 0634463 Pull Request resolved: #148601

torch/_inductor/package/package.py

desertfire · 2025-03-06T19:59:53Z

torch/csrc/inductor/aoti_eager/kernel_holder.cpp

    auto aoti_model_runer_fn = registered_aoti_runner[device_name];
-    return aoti_model_runer_fn(so_path, 1, device_name, "");
+    return aoti_model_runer_fn(
+        so_path, 1, device_name, "", /*run_single_threaded=*/false);


I think aoti_eager also only needs single thread here. Again, let's do that in a separate PR. cc @EikanWang

torch/csrc/inductor/aoti_package/model_package_loader.h

desertfire · 2025-03-06T20:26:47Z

Let me know what the new perf looks like.

jbschlosser · 2025-03-06T20:49:07Z

Let me know what the new perf looks like.

looks great! I'm no longer seeing a substantial difference between compile vs. export + AOTI + cudagraphs, where previously I was seeing a ~1.4% decrease from the export variant.

Compile:

Export + AOTI + manual CUDAGraphs:

BoyuanFeng · 2025-03-06T21:18:34Z

this is due to multi-threading logic in AOTIModelContainer / AOTIModel

Does this PR uses multithreading with cudagraph?

jbschlosser · 2025-03-06T21:38:09Z

Does this PR uses multithreading with cudagraph?

The multi-threading was on the AOTI runtime side and was conflicting with usage of cudagraphs. This PR provides a single-threaded alternative to running a loaded AOTI model so that cudagraphs can be used.

henrylhtsang · 2025-03-06T21:41:12Z

cc @henryhu6

eellison

A nice follow up would be for the aoti model to compose with cudagraph trees, as opposed to manual cuda graph wrapping. this would add the nice benefits of memory pool sharing, dynamic shape handling, etc, etc. although for only wrapping a single model it's not really essential.

I think it should (pretty much) just work, but there might some metadata to maintain that we would need to thread to cudagraphify.

henrylhtsang · 2025-03-06T22:04:06Z

Question:

how to use? Just run_single_threaded=True? do we need to enable cuda graphs?
is there a diff version to try out in fbcode?

**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead. When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to eellison for the logic here), I get: ``` Error: operation not permitted when stream is capturing ``` desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs. **Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details: * Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)` * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction. * C++ side introduces single-threaded alternatives to model running and model container running: * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed. * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`. I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed. **TODO**: Test case in progress. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 52f2920 Pull Request resolved: #148601

jbschlosser · 2025-03-06T22:16:22Z

@henrylhtsang

how to use? Just run_single_threaded=True? do we need to enable cuda graphs?

I just uploaded a test case in test/inductor/test_aot_inductor.py that should give you an idea on how to use it. The short answer is use torch._inductor.aoti_load_package(..., run_single_threaded=True) and manually capture a CUDAGraph for running the loaded model. The test case includes a simplistic cudagraph() function wrapper that copies inputs into fixed memory locations for the graph and clones outputs on exit.

is there a diff version to try out in fbcode?

not yet but soon this will be imported - you could try to ghimport it if you want it faster

jbschlosser · 2025-03-06T22:21:41Z

@desertfire one issue I just realized wrt flipping the default value is that this is not backwards compatible with previously serialized models. This functionality requires re-serializing so that the new interface function AOTInductorModelContainerRunSingleThreaded will be available in the .so.

desertfire · 2025-03-07T20:36:06Z

@desertfire one issue I just realized wrt flipping the default value is that this is not backwards compatible with previously serialized models. This functionality requires re-serializing so that the new interface function AOTInductorModelContainerRunSingleThreaded will be available in the .so.

Yes, you are right. I thought it's less an issue since AOTI+python is not a major use case, but since there are already use cases like it out there, we shouldn't break them.

As a future work, I think we can still flip the default value, and check if AOTInductorModelContainerRunSingleThreaded symbol exist. If not, we could print a warning instead of crash it.

desertfire · 2025-03-07T20:39:15Z

test/inductor/test_aot_inductor.py

+        # warmup -> run with CUDAGraphs
+        for _ in range(3):
+            optimized(**model_kwargs)
+


Let's add a result comparison with eager.

**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead. When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to eellison for the logic here), I get: ``` Error: operation not permitted when stream is capturing ``` desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs. **Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details: * Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)` * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction. * C++ side introduces single-threaded alternatives to model running and model container running: * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed. * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`. I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed. **Future work:** * Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist. * Compose with cudagraph trees as opposed to manual cuda graph wrapping cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 864b8d8 Pull Request resolved: #148601

jbschlosser · 2025-03-07T20:58:58Z

@pytorchbot merge

pytorchmergebot · 2025-03-07T21:01:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

henrylhtsang · 2025-03-11T02:15:58Z

noob question: Any idea if we need to pay attention to stream at all?

desertfire · 2025-03-13T16:06:28Z

noob question: Any idea if we need to pay attention to stream at all?

Are you asking how to use stream from the Python interface?

lecoan · 2025-05-20T10:46:38Z

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in AOTIModelContainer / AOTIModel) conflicting with the use of CUDAGraphs.

@desertfire I encountered the identical issue with PyTorch 2.5 and successfully resolved it by updating to the latest PyTorch version and using run_single_threaded API.

However, I'm still trying to understand the specific "multi-threading logic" in the AOTI runtime that causes this conflict. After examining the codebase, I noticed that the AOTI runtime only implements locks on the available_models_ property but doesn't appear to create any threads itself. From issue #134946, it seems the responsibility for thread creation lies with the caller.

Am I missing something in my understanding of the AOTI runtime's threading model? Could you clarify which specific multi-threading components in the AOTI runtime conflict with CUDAGraphs?

Thank you for your insights.

Fix for AOTI + CUDAGraphs when calling from Python

2d14cd3

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 5, 2025

jbschlosser added release notes: inductor topic: improvements topic category labels Mar 5, 2025

jbschlosser added a commit that referenced this pull request Mar 5, 2025

Fix for AOTI + CUDAGraphs when calling from Python

2f692d2

ghstack-source-id: 0634463 Pull Request resolved: #148601

jbschlosser requested a review from desertfire March 6, 2025 19:06

desertfire reviewed Mar 6, 2025

View reviewed changes

eellison reviewed Mar 6, 2025

View reviewed changes

jbschlosser added a commit that referenced this pull request Mar 6, 2025

Fix for AOTI + CUDAGraphs when calling from Python

7f7cbfc

ghstack-source-id: 52f2920 Pull Request resolved: #148601

jbschlosser requested a review from desertfire March 6, 2025 22:22

desertfire approved these changes Mar 7, 2025

View reviewed changes

jbschlosser added a commit that referenced this pull request Mar 7, 2025

Fix for AOTI + CUDAGraphs when calling from Python

a8aeb15

ghstack-source-id: 864b8d8 Pull Request resolved: #148601

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 7, 2025

pytorchmergebot added the merging label Mar 7, 2025

pytorchmergebot added the Merged label Mar 8, 2025

pytorchmergebot closed this in 85467ed Mar 8, 2025

pytorchmergebot removed the merging label Mar 8, 2025

desertfire mentioned this pull request Mar 13, 2025

BC fix for AOTIModelPackageLoader() constructor defaults #149082

Closed

desertfire mentioned this pull request Mar 27, 2025

Multi-threading with AOT inductor #134946

Closed

github-actions bot deleted the gh/jbschlosser/230/head branch April 20, 2025 02:19

desertfire mentioned this pull request Nov 12, 2025

[precompile] Integrate AOTI as a backend. #167338

Closed

Conversation

jbschlosser commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148601

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

desertfire Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

desertfire commented Mar 6, 2025

Uh oh!

jbschlosser commented Mar 6, 2025

Uh oh!

BoyuanFeng commented Mar 6, 2025

Uh oh!

jbschlosser commented Mar 6, 2025

Uh oh!

henrylhtsang commented Mar 6, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

henrylhtsang commented Mar 6, 2025

Uh oh!

jbschlosser commented Mar 6, 2025

Uh oh!

jbschlosser commented Mar 6, 2025

Uh oh!

desertfire commented Mar 7, 2025

Uh oh!

desertfire Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

jbschlosser commented Mar 7, 2025

Uh oh!

pytorchmergebot commented Mar 7, 2025

Merge started

Uh oh!

henrylhtsang commented Mar 11, 2025

Uh oh!

desertfire commented Mar 13, 2025

Uh oh!

lecoan commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jbschlosser commented Mar 5, 2025 •

edited

Loading

pytorch-bot bot commented Mar 5, 2025 •

edited

Loading

linux-foundation-easycla bot commented Mar 5, 2025 •

edited

Loading