Skip to content

[Feature] Asynchronous LoRA prefetch #8712

@lifuhuang

Description

@lifuhuang

Checklist

Motivation

Currently the main overheads for LoRA performance is coming from the process of loading adapters from CPU to GPU memory. While we have made several efforts optimizing this process in H1, this process itself is still synchronous and significantly slows down LoRA requests.

One possible solution is to implement some sorts of zero-overhead scheduling for LoRA, such that the prefetch process can be hidden.

Related resources

No response

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions