FIX: Avoid CUDA Graph re-record when hotswapping LoRAs.#2611
FIX: Avoid CUDA Graph re-record when hotswapping LoRAs.#2611BenjaminBossan merged 2 commits intohuggingface:mainfrom
Conversation
|
todo: |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for this PR. I'm not an expert in all the intricacies of torch.compile, so this is very welcome. If you have an example of where this helps, please share it and we can work on a unit test.
|
It should help all re-record cases related to weight change but even current tests have some weird re-record with current pr (cudagraphs so finicky to get it working properly). If you profile test case with torch.profiler, you should see "CUDAGraphs.record_function" only once but we see it twice. My case is in company code so ill update the pr once I manage to get a minimal working repro. |
Thanks for effort. |
|
Thanks very much for the PR.
You meant with this PR, we see it only once? 👁️ The PR looks good to me. We should merge after a unit-test. @anijain2305 if you have comments. |
|
@zou3519 on improving the ux for cudagraphs |
|
Turns out, we just needed a couple of configs to raise error on static input changes. also updated tests to include a third forward pass since first forward pass always going to be Current main pr |
Yes, with this PR, CUDAGraph.record_function will be called only once. Third and later calls will be replayed. This improves initial forward call after hotswap greatly in scenarios where LoRas swapped frequently |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks a lot @keepdying for identifying this torch.compile issue, fixing it, and extending the test case to cover it. I tested this locally (PR CI does not use GPU) and the test case indeed fails without the fix and works with it.
The failing CI is unrelated, so this PR is good to be merged.
|
Glad I could help. Thanks for this great library |
When the diffusers hotswap tests were added to PEFT in huggingface#2120, the diffusers test was marked as xfail because hotswapping was not yet implemented in diffusers. This has long been achieved but the test was not updated. This PR now updates the diffusers test in PEFT and removes the xfail. The new test is basically a copy of the corresponding test in diffusers. Moreover, I enhanced the test according to huggingface#2611 to also ensure that there are no CUDA graph re-records.
|
@BenjaminBossan do we wanna propagate this to |
|
@sayakpaul The fix should not need propagating, so I assume you mean the tests should be updated. Yes, indeed, I was starting with updating the diffusers test in PEFT: #2619. If that is good, we can apply the same logic to the diffusers tests. |
When the diffusers hotswap tests were added to PEFT in #2120, the diffusers test was marked as xfail because hotswapping was not yet implemented in diffusers. This has long been achieved but the test was not updated. This PR now updates the diffusers test in PEFT and removes the xfail. The new test is basically a copy of the corresponding test in diffusers. Moreover, I enhanced the test according to #2611 to also ensure that there are no CUDA graph re-records.
When the diffusers hotswap tests were added to PEFT in huggingface#2120, the diffusers test was marked as xfail because hotswapping was not yet implemented in diffusers. This has long been achieved but the test was not updated. This PR now updates the diffusers test in PEFT and removes the xfail. The new test is basically a copy of the corresponding test in diffusers. Moreover, I enhanced the test according to huggingface#2611 to also ensure that there are no CUDA graph re-records.
Motivation:
In some cases, recorded CUDA Graph gets silently invalided if
TORCH_LOGS="perf_hints"not provided, due to data pointer changes.Torch log:
This leads to an unexpected CUDA Graph re-record.
We can avoid this by copying LoRA weights in place without changing underlying tensor itself. I didn't see any performance regressions with in place copy.