Checklist
Motivation
Currently the main overheads for LoRA performance is coming from the process of loading adapters from CPU to GPU memory. While we have made several efforts optimizing this process in H1, this process itself is still synchronous and significantly slows down LoRA requests.
One possible solution is to implement some sorts of zero-overhead scheduling for LoRA, such that the prefetch process can be hidden.
Related resources
No response
Checklist
Motivation
Currently the main overheads for LoRA performance is coming from the process of loading adapters from CPU to GPU memory. While we have made several efforts optimizing this process in H1, this process itself is still synchronous and significantly slows down LoRA requests.
One possible solution is to implement some sorts of zero-overhead scheduling for LoRA, such that the prefetch process can be hidden.
Related resources
No response