Skip to content

FIX: Avoid CUDA Graph re-record when hotswapping LoRAs.#2611

Merged
BenjaminBossan merged 2 commits intohuggingface:mainfrom
yemreck:fix-cudagraph-recompiles
Jun 27, 2025
Merged

FIX: Avoid CUDA Graph re-record when hotswapping LoRAs.#2611
BenjaminBossan merged 2 commits intohuggingface:mainfrom
yemreck:fix-cudagraph-recompiles

Conversation

@yemreck
Copy link
Copy Markdown
Contributor

@yemreck yemreck commented Jun 25, 2025

Motivation:

In some cases, recorded CUDA Graph gets silently invalided if TORCH_LOGS="perf_hints" not provided, due to data pointer changes.

Torch log:

static input data pointer changed.
...
input name: arg17_1. data pointer changed from x to y. input stack trace:
...
  File "/opt/conda/lib/python3.11/site-packages/diffusers/models/resnet.py", line 346, in forward
    temb = self.time_emb_proj(temb)[:, :, None, None]
  File "/opt/conda/lib/python3.11/site-packages/peft/tuners/lora/layer.py", line 727, in forward
    result = result + lora_B(lora_A(dropout(x))) * scaling
...

This leads to an unexpected CUDA Graph re-record.

We can avoid this by copying LoRA weights in place without changing underlying tensor itself. I didn't see any performance regressions with in place copy.

@yemreck
Copy link
Copy Markdown
Contributor Author

yemreck commented Jun 25, 2025

todo:
tests don’t cover this case yet. didnt have time to make an isolated repro.

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. I'm not an expert in all the intricacies of torch.compile, so this is very welcome. If you have an example of where this helps, please share it and we can work on a unit test.

Comment thread tests/test_gpu_examples.py
@yemreck
Copy link
Copy Markdown
Contributor Author

yemreck commented Jun 26, 2025

It should help all re-record cases related to weight change but even current tests have some weird re-record with current pr (cudagraphs so finicky to get it working properly). If you profile test case with torch.profiler, you should see "CUDAGraphs.record_function" only once but we see it twice.

My case is in company code so ill update the pr once I manage to get a minimal working repro.

@BenjaminBossan
Copy link
Copy Markdown
Member

My case is in company code so ill update the pr once I manage to get a minimal working repro.

Thanks for effort.

@sayakpaul
Copy link
Copy Markdown
Member

Thanks very much for the PR.

If you profile test case with torch.profiler, you should see "CUDAGraphs.record_function" only once but we see it twice.

You meant with this PR, we see it only once? 👁️ The PR looks good to me. We should merge after a unit-test.

@anijain2305 if you have comments.

@anijain2305
Copy link
Copy Markdown

@zou3519 on improving the ux for cudagraphs

@yemreck
Copy link
Copy Markdown
Contributor Author

yemreck commented Jun 27, 2025

Turns out, we just needed a couple of configs to raise error on static input changes. also updated tests to include a third forward pass since first forward pass always going to be run_eager to warmup.

Current main

================================================ short test summary info ================================================
FAILED tests/test_gpu_examples.py::TestHotSwapping::test_hotswapping_compiled_model_does_not_trigger_recompilation[ranks0] - RuntimeError: static input data pointer changed.
FAILED tests/test_gpu_examples.py::TestHotSwapping::test_hotswapping_compiled_model_does_not_trigger_recompilation[ranks2] - RuntimeError: static input data pointer changed.

pr

tests/test_gpu_examples.py::TestHotSwapping::test_hotswapping_compiled_model_does_not_trigger_recompilation[ranks0] PASSED [ 33%]
tests/test_gpu_examples.py::TestHotSwapping::test_hotswapping_compiled_model_does_not_trigger_recompilation[ranks1] PASSED [ 66%]
tests/test_gpu_examples.py::TestHotSwapping::test_hotswapping_compiled_model_does_not_trigger_recompilation[ranks2] PASSED [100%]

@yemreck
Copy link
Copy Markdown
Contributor Author

yemreck commented Jun 27, 2025

Thanks very much for the PR.

If you profile test case with torch.profiler, you should see "CUDAGraphs.record_function" only once but we see it twice.

You meant with this PR, we see it only once? 👁️ The PR looks good to me. We should merge after a unit-test.

@anijain2305 if you have comments.

Yes, with this PR, CUDAGraph.record_function will be called only once. Third and later calls will be replayed.

This improves initial forward call after hotswap greatly in scenarios where LoRas swapped frequently

@yemreck yemreck changed the title FIX: Avoid possible CUDA Graph re-record when hotswapping LoRAs. FIX: Avoid CUDA Graph re-record when hotswapping LoRAs. Jun 27, 2025
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @keepdying for identifying this torch.compile issue, fixing it, and extending the test case to cover it. I tested this locally (PR CI does not use GPU) and the test case indeed fails without the fix and works with it.

The failing CI is unrelated, so this PR is good to be merged.

@BenjaminBossan BenjaminBossan merged commit bbc9f5d into huggingface:main Jun 27, 2025
2 of 14 checks passed
@yemreck
Copy link
Copy Markdown
Contributor Author

yemreck commented Jun 27, 2025

Glad I could help. Thanks for this great library

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Jun 27, 2025
When the diffusers hotswap tests were added to PEFT in huggingface#2120, the
diffusers test was marked as xfail because hotswapping was not yet
implemented in diffusers. This has long been achieved but the test was
not updated.

This PR now updates the diffusers test in PEFT and removes the xfail.
The new test is basically a copy of the corresponding test in diffusers.
Moreover, I enhanced the test according to huggingface#2611 to also ensure that
there are no CUDA graph re-records.
@sayakpaul
Copy link
Copy Markdown
Member

@BenjaminBossan do we wanna propagate this to diffusers too?

@BenjaminBossan
Copy link
Copy Markdown
Member

@sayakpaul The fix should not need propagating, so I assume you mean the tests should be updated. Yes, indeed, I was starting with updating the diffusers test in PEFT: #2619. If that is good, we can apply the same logic to the diffusers tests.

BenjaminBossan added a commit that referenced this pull request Jul 2, 2025
When the diffusers hotswap tests were added to PEFT in #2120, the
diffusers test was marked as xfail because hotswapping was not yet
implemented in diffusers. This has long been achieved but the test was
not updated.

This PR now updates the diffusers test in PEFT and removes the xfail.
The new test is basically a copy of the corresponding test in diffusers.
Moreover, I enhanced the test according to #2611 to also ensure that
there are no CUDA graph re-records.
efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025
efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025
When the diffusers hotswap tests were added to PEFT in huggingface#2120, the
diffusers test was marked as xfail because hotswapping was not yet
implemented in diffusers. This has long been achieved but the test was
not updated.

This PR now updates the diffusers test in PEFT and removes the xfail.
The new test is basically a copy of the corresponding test in diffusers.
Moreover, I enhanced the test according to huggingface#2611 to also ensure that
there are no CUDA graph re-records.
cyyever pushed a commit to cyyever/peft that referenced this pull request Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants