-
Notifications
You must be signed in to change notification settings - Fork 161
Description
Proposal
mlx-swift uses the core mlx C library as a bridge to expose all the underlying features of core for use in Swift. For those using the C/C++ library directly, you have direct access to wired memory and the residency set APIs, including calling set_wired_memory directly. It would be good to expose this functionality here so downstream libraries have the ability to expose wired memory budgeting and concurrency APIs for performance and stability. Other options besides just reexporting listed below.
Why here?
I found that, unlike the python mlx-lm package, mlx-swift-lm does not wire down memory automatically, causing performance issues for models that barely fit in memory as pointed out in my PR.
I benchmarked 3 different scenarios. Without setting wired memory, setting it to a calculated & fixed value, and setting it to max. I found that the calculated and max runs had the same performance, so asking for maximum residency is not necessary to get decent performance, and intentionally budgeting wired memory would encourage developers to think about how to balance performance and stability when using MLX on constrained systems such as iOS devices.
mlx-swift already exposes an API for temporarily setting wired memory using a closure, just like how we do it in the Python libraries, but the limitation here is that the residency set is process-scoped. For Python this isn't an issue because it is single-threaded, and if you want to run multiple inferences you could manage it with multiple processes. With Swift, especially on iOS, we would have to manage this with multi-threading. Developers currently have these options:
- Don't set wired memory and incur severe performance penalties when running inference (or other work loads) with high memory requirements
- Only run one inference at a time with wired memory inside a closure
- Run multiple inferences and risk default no longer being the default
Allowing this library to expose the knob without the closure would allow granular control, and it belongs here because any changes to upstream C/C++ mlx would require the adjustments to be made here, and downstream libraries would not have to do any additional work such as linking directly to the set_wired_memory function.
Options for implementation
- Reexporting the function as public
- Simple, limits responsibilities to downstream libraries or developers
- Easy to maintain
- Potentially leads to footgun if misused
- Exposing an atomic/reference counted setter
- Multiple Swift tasks can set memory
- As soon as there are no more references, reverts to default automatically
- Limits responsibility to downstream libraries/devs, but does not expose the setter directly
- Completely implement task-oriented wired memory setting
- Makes downstream libraries/devs use a carefully engineered system for managing memory
- All downstream libraries/devs have to do is use the system (no repetitive reimplementations, standardized)
- Limits responsibility for maintenance to here
Those are the options I came up with, but I'd be curious to see where everyone else is at with this issue. I'm leaning towards option 3, and already have a lot of working code in my original PR over at mlx-swift-lm that could be easily translated and reshaped to work here.