Checklist
Describe the bug
In the load_lora_weight_to_buffer function, we zero out A_buffer when uid == None (code reference) to prevent leftover weights of the previously evicted LoRA adapters from interfering with subsequent computations.
However, I suspect we should do the same even when uid != None, because in theory different adapters could target different modules (e.g., some adapters do not target k_proj). Our code might not be able to handle this case correctly, for example, if we have two adapters: lora1 targets k_proj, lora2 does not. If lora2 is reusing the memory buffer left by lora1 after its eviction, the k_proj weight of lora1 would remain in the buffer and potentially contaminate the computation of lora2. I discussed this with @Fridge003 and @Qiaolin-Yu offline and they have the same suspicion.
As this is a rare corner case, I have not got a chance to construct a test to verify it. I am creating this issue to track this potential bug. We need to:
- verify: construct a test case to repro the issue, e.g., setting
max-loras-per-batch = 1 but have 2 adapters with different target weights.
- fix: always zero out buffer during gpu buffer eviction.
- benchmark: verify perf overheads introduced by the zero-out operation.
Reproduction
See first comment.
Environment
Bug is environment agnostic
Checklist
Describe the bug
In the load_lora_weight_to_buffer function, we zero out
A_bufferwhenuid == None(code reference) to prevent leftover weights of the previously evicted LoRA adapters from interfering with subsequent computations.However, I suspect we should do the same even when
uid != None, because in theory different adapters could target different modules (e.g., some adapters do not target k_proj). Our code might not be able to handle this case correctly, for example, if we have two adapters: lora1 targets k_proj, lora2 does not. If lora2 is reusing the memory buffer left by lora1 after its eviction, the k_proj weight of lora1 would remain in the buffer and potentially contaminate the computation of lora2. I discussed this with @Fridge003 and @Qiaolin-Yu offline and they have the same suspicion.As this is a rare corner case, I have not got a chance to construct a test to verify it. I am creating this issue to track this potential bug. We need to:
max-loras-per-batch = 1but have 2 adapters with different target weights.Reproduction
See first comment.
Environment
Bug is environment agnostic