Fix for AOTI + CUDAGraphs when calling from Python#148601
Fix for AOTI + CUDAGraphs when calling from Python#148601jbschlosser wants to merge 4 commits intogh/jbschlosser/230/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148601
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 8379d55 with merge base c65ee72 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
TODO: tests. I'm currently verifying this on my real-world Flux use case but I'll distill this down to a nice test afterwards. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
| auto aoti_model_runer_fn = registered_aoti_runner[device_name]; | ||
| return aoti_model_runer_fn(so_path, 1, device_name, ""); | ||
| return aoti_model_runer_fn( | ||
| so_path, 1, device_name, "", /*run_single_threaded=*/false); |
There was a problem hiding this comment.
I think aoti_eager also only needs single thread here. Again, let's do that in a separate PR. cc @EikanWang
|
Let me know what the new perf looks like. |
Does this PR uses multithreading with cudagraph? |
The multi-threading was on the AOTI runtime side and was conflicting with usage of cudagraphs. This PR provides a single-threaded alternative to running a loaded AOTI model so that cudagraphs can be used. |
|
cc @henryhu6 |
eellison
left a comment
There was a problem hiding this comment.
A nice follow up would be for the aoti model to compose with cudagraph trees, as opposed to manual cuda graph wrapping. this would add the nice benefits of memory pool sharing, dynamic shape handling, etc, etc. although for only wrapping a single model it's not really essential.
I think it should (pretty much) just work, but there might some metadata to maintain that we would need to thread to cudagraphify.
|
Question:
|
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.
When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```
desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.
**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
* This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
* `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
* Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.
I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.
**TODO**: Test case in progress.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov
[ghstack-poisoned]
I just uploaded a test case in
not yet but soon this will be imported - you could try to |
|
@desertfire one issue I just realized wrt flipping the default value is that this is not backwards compatible with previously serialized models. This functionality requires re-serializing so that the new interface function |
Yes, you are right. I thought it's less an issue since AOTI+python is not a major use case, but since there are already use cases like it out there, we shouldn't break them. As a future work, I think we can still flip the default value, and check if |
| # warmup -> run with CUDAGraphs | ||
| for _ in range(3): | ||
| optimized(**model_kwargs) | ||
|
|
There was a problem hiding this comment.
Let's add a result comparison with eager.
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.
When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```
desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.
**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
* This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
* `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
* Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.
I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.
**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
* There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov
[ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
noob question: Any idea if we need to pay attention to stream at all? |
Are you asking how to use stream from the Python interface? |
@desertfire I encountered the identical issue with PyTorch 2.5 and successfully resolved it by updating to the latest PyTorch version and using However, I'm still trying to understand the specific "multi-threading logic" in the AOTI runtime that causes this conflict. After examining the codebase, I noticed that the AOTI runtime only implements locks on the Am I missing something in my understanding of the AOTI runtime's threading model? Could you clarify which specific multi-threading components in the AOTI runtime conflict with CUDAGraphs? Thank you for your insights. |


Stack from ghstack (oldest at bottom):
Background: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.
When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in
AOTIModelContainer/AOTIModel) conflicting with the use of CUDAGraphs.Fix: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
torch._inductor.package.load_package(..., run_single_threaded=False)AOTIModelPackageLoader, which passes it to theCreateAOTIModelRunnerFuncduringAOTIModelContainerRunnerconstruction.AOTIModelContainer.run_single_threaded()/AOTIModel.run_single_threaded(). The interfaces match those ofrun(), but the synchronization logic has been removed.AOTInductorModelContainerRunSingleThreadedto AOTI'sinterface.h; this is invoked by theAOTIModelContainerRunnerutility class whenrun_single_threaded=true.I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.
Future work:
run_single_threaded=Trueas Python-side inference doesn't take advantage of the AOTI runtime thread poolAOTInductorModelContainerRunSingleThreadedinterface func. We can flip the default value and warn (instead of crashing) if theAOTInductorModelContainerRunSingleThreadedsymbol does not exist.cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov