Skip to content

[Feature] Optimize LoRA Loading Mechanism to Decouple User Limits from CPU Memory Constraints #10266

@lw9527

Description

@lw9527

Checklist

Motivation

Currently, the --max-loaded-loras parameter imposes a hard limit on both:

  1. The number of LoRAs that can be loaded into CPU memory
  2. The number of LoRAs a user is allowed to load

This creates a suboptimal user experience, as users may be prevented from loading enough LoRAs.

We hope a more flexible approach where:

  • The CPU memory constraint remains (to prevent OOM errors)
  • User-facing limits on the number of LoRAs are removed
  • LoRAs are dynamically loaded/unloaded from CPU memory based on user requests, within the bounds of available memory

This would allow users to request any number of LoRAs while the system automatically manages memory usage, resulting in a smoother experience without artificial restrictions.

Expected Behavior:

  • users can submit requests for any number of LoRAs without arbitrary limits
  • the --max-loaded-loras limits the number of Loras that can be loaded into CPU memory
  • When max num Loras is reached, least recently used (or other smart eviction) LoRAs are unloaded to make space with new request.

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions