Skip to content

Fix for AOTI + CUDAGraphs when calling from Python#148601

Closed
jbschlosser wants to merge 4 commits intogh/jbschlosser/230/basefrom
gh/jbschlosser/230/head
Closed

Fix for AOTI + CUDAGraphs when calling from Python#148601
jbschlosser wants to merge 4 commits intogh/jbschlosser/230/basefrom
gh/jbschlosser/230/head

Conversation

@jbschlosser
Copy link
Contributor

@jbschlosser jbschlosser commented Mar 5, 2025

Stack from ghstack (oldest at bottom):

Background: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:

Error: operation not permitted when stream is capturing

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in AOTIModelContainer / AOTIModel) conflicting with the use of CUDAGraphs.

Fix: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:

  • Python side introduces a new flag to enable this behavior (needs a better name): torch._inductor.package.load_package(..., run_single_threaded=False)
    • This flag is passed down to the C++ side's AOTIModelPackageLoader, which passes it to the CreateAOTIModelRunnerFunc during AOTIModelContainerRunner construction.
  • C++ side introduces single-threaded alternatives to model running and model container running:
    • AOTIModelContainer.run_single_threaded() / AOTIModel.run_single_threaded(). The interfaces match those of run(), but the synchronization logic has been removed.
    • Introduces AOTInductorModelContainerRunSingleThreaded to AOTI's interface.h; this is invoked by the AOTIModelContainerRunner utility class when run_single_threaded=true.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

Future work:

  • Flip default value to run_single_threaded=True as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    • There are some BC concerns here - models need to be re-serialized so the .so contains the new AOTInductorModelContainerRunSingleThreaded interface func. We can flip the default value and warn (instead of crashing) if the AOTInductorModelContainerRunSingleThreaded symbol does not exist.
  • Compose with cudagraph trees as opposed to manual cuda graph wrapping

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148601

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8379d55 with merge base c65ee72 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 5, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

TODO: tests. I'm currently verifying this on my real-world Flux use case but I'll distill this down to a nice test afterwards.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
jbschlosser added a commit that referenced this pull request Mar 5, 2025
ghstack-source-id: 0634463
Pull Request resolved: #148601
@jbschlosser jbschlosser requested a review from desertfire March 6, 2025 19:06
auto aoti_model_runer_fn = registered_aoti_runner[device_name];
return aoti_model_runer_fn(so_path, 1, device_name, "");
return aoti_model_runer_fn(
so_path, 1, device_name, "", /*run_single_threaded=*/false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think aoti_eager also only needs single thread here. Again, let's do that in a separate PR. cc @EikanWang

@desertfire
Copy link
Contributor

Let me know what the new perf looks like.

@jbschlosser
Copy link
Contributor Author

Let me know what the new perf looks like.

looks great! I'm no longer seeing a substantial difference between compile vs. export + AOTI + cudagraphs, where previously I was seeing a ~1.4% decrease from the export variant.

Compile:

compile_perf

Export + AOTI + manual CUDAGraphs:

export_perf

@BoyuanFeng
Copy link
Contributor

this is due to multi-threading logic in AOTIModelContainer / AOTIModel

Does this PR uses multithreading with cudagraph?

@jbschlosser
Copy link
Contributor Author

Does this PR uses multithreading with cudagraph?

The multi-threading was on the AOTI runtime side and was conflicting with usage of cudagraphs. This PR provides a single-threaded alternative to running a loaded AOTI model so that cudagraphs can be used.

@henrylhtsang
Copy link
Contributor

cc @henryhu6

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice follow up would be for the aoti model to compose with cudagraph trees, as opposed to manual cuda graph wrapping. this would add the nice benefits of memory pool sharing, dynamic shape handling, etc, etc. although for only wrapping a single model it's not really essential.

I think it should (pretty much) just work, but there might some metadata to maintain that we would need to thread to cudagraphify.

@henrylhtsang
Copy link
Contributor

Question:

  1. how to use? Just run_single_threaded=True? do we need to enable cuda graphs?
  2. is there a diff version to try out in fbcode?

**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**TODO**: Test case in progress.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
jbschlosser added a commit that referenced this pull request Mar 6, 2025
ghstack-source-id: 52f2920
Pull Request resolved: #148601
@jbschlosser
Copy link
Contributor Author

@henrylhtsang

how to use? Just run_single_threaded=True? do we need to enable cuda graphs?

I just uploaded a test case in test/inductor/test_aot_inductor.py that should give you an idea on how to use it. The short answer is use torch._inductor.aoti_load_package(..., run_single_threaded=True) and manually capture a CUDAGraph for running the loaded model. The test case includes a simplistic cudagraph() function wrapper that copies inputs into fixed memory locations for the graph and clones outputs on exit.

is there a diff version to try out in fbcode?

not yet but soon this will be imported - you could try to ghimport it if you want it faster

@jbschlosser
Copy link
Contributor Author

@desertfire one issue I just realized wrt flipping the default value is that this is not backwards compatible with previously serialized models. This functionality requires re-serializing so that the new interface function AOTInductorModelContainerRunSingleThreaded will be available in the .so.

@jbschlosser jbschlosser requested a review from desertfire March 6, 2025 22:22
@desertfire
Copy link
Contributor

@desertfire one issue I just realized wrt flipping the default value is that this is not backwards compatible with previously serialized models. This functionality requires re-serializing so that the new interface function AOTInductorModelContainerRunSingleThreaded will be available in the .so.

Yes, you are right. I thought it's less an issue since AOTI+python is not a major use case, but since there are already use cases like it out there, we shouldn't break them.

As a future work, I think we can still flip the default value, and check if AOTInductorModelContainerRunSingleThreaded symbol exist. If not, we could print a warning instead of crash it.

# warmup -> run with CUDAGraphs
for _ in range(3):
optimized(**model_kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a result comparison with eager.

**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
jbschlosser added a commit that referenced this pull request Mar 7, 2025
ghstack-source-id: 864b8d8
Pull Request resolved: #148601
@jbschlosser
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 7, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@henrylhtsang
Copy link
Contributor

noob question: Any idea if we need to pay attention to stream at all?

@desertfire
Copy link
Contributor

noob question: Any idea if we need to pay attention to stream at all?

Are you asking how to use stream from the Python interface?

@github-actions github-actions bot deleted the gh/jbschlosser/230/head branch April 20, 2025 02:19
@lecoan
Copy link
Contributor

lecoan commented May 20, 2025

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in AOTIModelContainer / AOTIModel) conflicting with the use of CUDAGraphs.

@desertfire I encountered the identical issue with PyTorch 2.5 and successfully resolved it by updating to the latest PyTorch version and using run_single_threaded API.

However, I'm still trying to understand the specific "multi-threading logic" in the AOTI runtime that causes this conflict. After examining the codebase, I noticed that the AOTI runtime only implements locks on the available_models_ property but doesn't appear to create any threads itself. From issue #134946, it seems the responsibility for thread creation lies with the caller.

Am I missing something in my understanding of the AOTI runtime's threading model? Could you clarify which specific multi-threading components in the AOTI runtime conflict with CUDAGraphs?

Thank you for your insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants