Checklist
Motivation
Currently, the --max-loaded-loras parameter imposes a hard limit on both:
- The number of LoRAs that can be loaded into CPU memory
- The number of LoRAs a user is allowed to load
This creates a suboptimal user experience, as users may be prevented from loading enough LoRAs.
We hope a more flexible approach where:
- The CPU memory constraint remains (to prevent OOM errors)
- User-facing limits on the number of LoRAs are removed
- LoRAs are dynamically loaded/unloaded from CPU memory based on user requests, within the bounds of available memory
This would allow users to request any number of LoRAs while the system automatically manages memory usage, resulting in a smoother experience without artificial restrictions.
Expected Behavior:
- users can submit requests for any number of LoRAs without arbitrary limits
- the
--max-loaded-loras limits the number of Loras that can be loaded into CPU memory
- When max num Loras is reached, least recently used (or other smart eviction) LoRAs are unloaded to make space with new request.
Related resources
No response
Checklist
Motivation
Currently, the
--max-loaded-lorasparameter imposes a hard limit on both:This creates a suboptimal user experience, as users may be prevented from loading enough LoRAs.
We hope a more flexible approach where:
This would allow users to request any number of LoRAs while the system automatically manages memory usage, resulting in a smoother experience without artificial restrictions.
Expected Behavior:
--max-loaded-loraslimits the number of Loras that can be loaded into CPU memoryRelated resources
No response