FEAT Add hotswapping functionality#2120
Conversation
See also huggingface/diffusers#9453 The idea of hotswapping an adapter is the following: We can already load multiple adapters, e.g. two LoRAs, at the same time. But sometimes, we want to load one LoRA and then replace its weights in-place with the LoRA weights of another adapter. This is now possible the hotswap_adapter function. In general, this should be faster than deleting one adapter and loading the adapter in its place, which would be the current way to achieve the same final outcome. Another advantage of hotswapping is that it prevents re-compilation in case the PEFT model is already compiled. This can save quite a lot of time. There are some caveats for hotswapping: - It only works for the same PEFT method, so no swapping LoRA and LoHa. - Right now, only LoRA is properly supported. - The adapters must be compatible (e.g. same LoRA alpha, same target modules).
| return peft_model_state_dict, mismatched | ||
|
|
||
|
|
||
| def _insert_adapter_name_into_state_dict( |
There was a problem hiding this comment.
This is the same code as before, but factored out into a function so that it can be reused for hotswapping.
| else: | ||
| state_dict = peft_model_state_dict | ||
|
|
||
| if config.peft_type in ( |
There was a problem hiding this comment.
This change is unrelated but I wanted to clean this up.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sayakpaul
left a comment
There was a problem hiding this comment.
Very cool work! I have left a couple of comments. Let me know if they make sense.
| # real check: model now behaves again like adapter 0 | ||
| assert torch.allclose(output0, output_loaded_back0, atol=atol, rtol=rtol) | ||
|
|
||
| def test_hotswap_incompatible_config_params_raises(self, tmp_path): |
There was a problem hiding this comment.
@yiyixuxu has very nice PoC of support this to some extent:
huggingface/diffusers#9453 (comment)
Maybe we could leverage that?
There was a problem hiding this comment.
Ah yes, sorry, I somehow missed this.
My plan would be to restrict this feature to require same alphas and, when wanting to avoid recompilation, also same rank. I would address those issues in a follow up PR to keep this already big PR from growing even further. WDYT?
There was a problem hiding this comment.
Alright. That works for me.
Then I guess we need to work on that follow-up PR first before making progress in the diffusers PR (huggingface/diffusers#9453).
There was a problem hiding this comment.
I guess it depends. If you think that without these features, it's not useful enough, we should wait to create the right impact.
Regarding the different LoRA sizes, IIUC, it would only work with padding the weights to the largest size. This is not something we can automate, as we don't know the largest size ahead of time.
As for the alphas, we would need to ensure that converting to scalars has no adverse effects on other things, which is why I wanted to exclude this from the PR for now.
There was a problem hiding this comment.
Oh, okay, thanks for explaining. Yeah, without the support for varied rank LoRAs and alphas, this feature won't have much value in the diffusion world, sadly.
Perhaps we can ship this iteration first and work on supporting varied ranks and alphas afterward.
There was a problem hiding this comment.
Yes, that would be the idea. For now, I've documented the limitations but as YiYi showed, we should hopefully be able to work around them.
Is there anything left to do in this PR?
| # check that the recompilation message is not present | ||
| assert "__recompiles" not in stderr.decode() | ||
|
|
||
| # contingency check: without hotswapping, we *do* get recompilation | ||
| process = subprocess.Popen( | ||
| [sys.executable, file_name, "0"], env=env, stdout=subprocess.PIPE, stderr=subprocess.PIPE | ||
| ) | ||
|
|
||
| # Communicate will read the output and error streams, preventing deadlock | ||
| stdout, stderr = process.communicate() | ||
| exit_code = process.returncode | ||
|
|
||
| # sanity check: | ||
| assert exit_code == 0 | ||
|
|
||
| # check that the recompilation message is not present | ||
| assert "__recompiles" in stderr.decode() |
Not supported for now.
Equivalent to test in diffusers #9453
Marker needs to be removed when diffusers merges the hotswap feature.
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks!
My main comment is around https://github.com/huggingface/peft/pull/2120/files#r1804391193. LMK if that makes sense.
| - It only works for the same PEFT method, so no swapping LoRA and LoHa, for example. | ||
| - Right now, only LoRA is properly supported. | ||
| - The adapters must be compatible (e.g. same LoRA alpha, same target modules). | ||
|
|
There was a problem hiding this comment.
Could add a note saying this is not limited to transformers and works with diffusers, too. But if we wanna wait until huggingface/diffusers#9453 is figured out and merged, I will understand.
There was a problem hiding this comment.
I added a sentence. It should already work with diffusers models when users use the hotswap_adapter function, it's just not natively in diffusers yet, so I'm fine with adding it.
| # real check: model now behaves again like adapter 0 | ||
| assert torch.allclose(output0, output_loaded_back0, atol=atol, rtol=rtol) | ||
|
|
||
| def test_hotswap_incompatible_config_params_raises(self, tmp_path): |
| torch_device = "cuda" if torch.cuda.is_available() else "cpu" | ||
|
|
||
|
|
||
| def get_small_unet(): |
There was a problem hiding this comment.
Could also add a note saying that currently, it does not work in the full pipeline context when compile is enabled.
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks for your patience! Excellent start!
The idea of hotswapping an adapter is the following: We can already load multiple adapters, e.g. two LoRAs, at the same time. But sometimes, we want to load one LoRA and then replace its weights in-place with the LoRA weights of another adapter. This is now possible the hotswap_adapter function. In general, this should be faster than deleting one adapter and loading the adapter in its place, which would be the current way to achieve the same final outcome. Another advantage of hotswapping is that it prevents re-compilation in case the PEFT model is already compiled. This can save quite a lot of time. There are some caveats for hotswapping: - It only works for the same PEFT method, so no swapping LoRA and LoHa. - Right now, only LoRA is properly supported. - The adapters must be compatible (e.g. same LoRA alpha, same target modules). - To avoid recompilation, ranks must be identical See also huggingface/diffusers#9453
When the diffusers hotswap tests were added to PEFT in huggingface#2120, the diffusers test was marked as xfail because hotswapping was not yet implemented in diffusers. This has long been achieved but the test was not updated. This PR now updates the diffusers test in PEFT and removes the xfail. The new test is basically a copy of the corresponding test in diffusers. Moreover, I enhanced the test according to huggingface#2611 to also ensure that there are no CUDA graph re-records.
When the diffusers hotswap tests were added to PEFT in #2120, the diffusers test was marked as xfail because hotswapping was not yet implemented in diffusers. This has long been achieved but the test was not updated. This PR now updates the diffusers test in PEFT and removes the xfail. The new test is basically a copy of the corresponding test in diffusers. Moreover, I enhanced the test according to #2611 to also ensure that there are no CUDA graph re-records.
When the diffusers hotswap tests were added to PEFT in huggingface#2120, the diffusers test was marked as xfail because hotswapping was not yet implemented in diffusers. This has long been achieved but the test was not updated. This PR now updates the diffusers test in PEFT and removes the xfail. The new test is basically a copy of the corresponding test in diffusers. Moreover, I enhanced the test according to huggingface#2611 to also ensure that there are no CUDA graph re-records.
* generalizes vst script * precommit * change launch command to use accelerate * updates docs * rename to sft_vlm * fix script location * fix formatting * comma * add model link * fix name --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
See also huggingface/diffusers#9453
The idea of hotswapping an adapter is the following: We can already load multiple adapters, e.g. two LoRAs, at the same time. But sometimes, we want to load one LoRA and then replace its weights in-place with the LoRA weights of another adapter. This is now possible the hotswap_adapter function.
In general, this should be faster than deleting one adapter and loading the adapter in its place, which would be the current way to achieve the same final outcome. Another advantage of hotswapping is that it prevents re-compilation in case the PEFT model is already compiled. This can save quite a lot of time.
There are some caveats for hotswapping: