Skip to content

Wired Memory Management System#348

Merged
davidkoski merged 7 commits intoml-explore:mainfrom
robertmsale:wired-memory-helper
Feb 6, 2026
Merged

Wired Memory Management System#348
davidkoski merged 7 commits intoml-explore:mainfrom
robertmsale:wired-memory-helper

Conversation

@robertmsale
Copy link
Contributor

Proposed changes

Implements option 3 for #347 by moving the wired memory coordinator into mlx‑swift and exposing a generic manager/policy/ticket API.

  • Adds WiredMemoryManager, WiredMemoryTicket, WiredMemoryPolicy, and WiredMemoryEvent.
  • Introduces reservation vs active tickets and hysteresis (threshold + cooldown) to avoid shrink thrash while work is active.
  • Uses policy id (Identifiable) for grouping.
  • Adds tests covering policy stacking, reservations, hysteresis, cooldown, and admission behavior.
  • Adds doc comments and new DocC article (wired-memory.md) plus README/MLX doc updates.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Robert Sale added 2 commits January 31, 2026 15:28
@robertmsale robertmsale marked this pull request as ready for review February 1, 2026 01:04
@robertmsale
Copy link
Contributor Author

Tested end-to-end with the changes in mlx-swift-lm most recent commits. So the pipeline from here to inference over there is effectively wired together with unit tests demonstrating the system, how to use it effectively, and preserving default behavior to some degree.

@robertmsale
Copy link
Contributor Author

Screenshot 2026-01-31 at 5 32 57 PM

System overview

@robertmsale
Copy link
Contributor Author

I added policy-only mode for CPU and future CUDA workflows (I see open issues about CUDA support with indications of inclusion) and made it so Apple Silicon devices in CPU-only mode can still use maximumRecommendedWorkingSetBytes as a reference cap 😄

MLX only provides the generic interfaces. MLXLMCommon (from mlx-swift-lm)
provides LLM-focused policies such as `WiredSumPolicy`, `WiredMaxPolicy`, and
`WiredFixedPolicy`. You can use `GPU.maxRecommendedWorkingSetBytes()` as a
portable upper bound when designing custom policies.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if one or more of these should be in mlx-swift? WiredSumPolicy or WiredMaxPolicy for example might be generic enough and could be used in domains outside of llms. I do agree that the interesting policies will be domain specific.

try await ticket.withWiredLimit {
// run inference
}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clear example!


// Reserve model weights without keeping the limit elevated while idle.
let weights = policy.ticket(size: weightsBytes, kind: .reservation)
_ = await weights.start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a way to cancel this? E.g. you might have a server that would load and unload models (weights). It might make sense to cancel the policy rather than this reservation, or if this returned a Cancellable you could cancel the specific reservation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this was inside a Task you could cancel the task (I can see how that works), but this example looks like you might want to hold a ticket long term and wrapping it in a Task is unwieldy.


### Choosing a baseline

When wired memory is unsupported, the manager will use:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be unclear -- if wired memory is unsupported what do these values actually do? I would think it would just be a NOP.

Comment on lines +25 to +28
let result = mlx_set_wired_limit(&previous, 0)
guard result == 0 else { return nil }
var tmp: size_t = 0
_ = mlx_set_wired_limit(&tmp, previous)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that temporarily setting it to 0 may cause trouble. Can we avoid using this? What if the manager could answer this and we just document that use of multiple managers outside a testing context is ill-advised or undefined. The manager sets it so it should surely have this answer.

///
/// These settings implement hysteresis to prevent small or frequent shrinks
/// while active work is running. Growing the limit is always allowed; shrinking
/// is gated by a minimum drop and a minimum time between changes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you observed being a problem? If so, awesome -- it seems like it could be useful to keep active memory hot.


if let lastLimitChange {
let elapsed = Date().timeIntervalSince(lastLimitChange)
if elapsed < configuration.shrinkCooldown {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one concern here: the way I read this is we won't shrink unless another request comes through here?

Consider:

  • request for N bytes of wired memory
  • request is complete, reduce back to 0 but per the timeout no
  • no more requests come in

Does the process still hold the wired memory? I think so, but perhaps there is a path I didn't see.

/// wired memory control is unsupported (e.g. CPU-only execution). The
/// manager will not attempt to change wired memory, but tickets can still
/// gate admission and emit events.
public var policyOnlyWhenUnsupported: Bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit to having the default be false? Or even having this config -- why not just always do this? It seems harmless (the compute cost is minimal even if the result is a NOP).

/// Debug label for the policy group, if applicable.
public let policy: String?
/// Baseline wired limit captured from the system.
public let baseline: Int?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the term baseline needs an explicit definition? My concept is that this is the idle floor -- we don't go below this, even when there are no outstanding requests. But the unsupported case and in the code seem to tie this to the max supported value -- I am not sure what the intent is.

I think the baseline today (without this code) is 0.


/// Stable grouping key for policies.
private enum PolicyKey: Hashable {
case identifier(AnyHashable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this enum? Why not just use AnyHashable as the key? This is internal to the actor so if the implementation requires adding another level it is safe to do so without callers being aware.

public func end(id: UUID, policy: any WiredMemoryPolicy) async -> Int {
if let waiter = waiters.removeValue(forKey: id) {
waiter.resume()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a call to resumeWaiters() at the end that will awaken all the waiters. Why have a one off here? (not saying it is wrong, but it isn't clear to me -- if we need it I think maybe it needs a comment)

waiter.resume()
}

guard WiredMemoryBackend.isSupported || policyOnlyMode else {
Copy link
Collaborator

@davidkoski davidkoski Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isSupported will return false if this is a Task set for the cpu device. Two questions:

  • will returning a 0 here affect the wired memory for the GPU tasks? would a nil be better to indicate "no change"?
    • the return value is more informative -- policy is applied by applyCurrentLimit(). would the return from the end of the function make more sense? currentLimit ?? baseline ?? 0
  • is it ok that resumeWaiters() is not called (also consider this question if the one-off in a previous line isn't needed) -- I think maybe this is ok because the wired memory probably doesn't change here

}

guard let state = tickets.removeValue(forKey: id) else {
emit(kind: .ticketEndIgnored, ticketID: id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case where this would happen? I wonder if this should be a fatalError -- is some invariant violated here (like we lost a ticket)?

I think at one point it was mentioned that end should be idempotent. Should it? Or is that papering over programming errors? free() isn't idempotent and is a similar resource release idea.

continuation.onTermination = { _ in
Task { await self.removeEventContinuation(id: id) }
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safe -- the call to AsyncStream is synchronous so the interior happens before it returns. It surprised me because I was used to this form:

    let (stream, continuation) = AsyncStream<Generation>.makeStream()

I don't have a strong preference - I think they are the same, just pointing out as I had to read to make sure I understood.

var baselineValue = ensureBaseline(refresh: baseline == nil || !hasActiveWork())
if tickets[id] != nil {
emit(
kind: .ticketStartIgnored,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question on the end -- a double start might be a programming error. Is there a use case for allowing this?

///
/// If this was the last ticket, the manager restores the baseline and
/// clears internal state.
public func end(id: UUID, policy: any WiredMemoryPolicy) async -> Int {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is marked as async but I do not see any await inside. Callers (outside the actor) have to await anyway as that is what the actor requires, but this would also force callers inside the actor (if there were any) to 1) be async and 2) have a potentially a suspend point, which is requires a little more care.

I think this should remove the async.

Comment on lines +570 to +571
if WiredMemoryBackend.isSupported {
return WiredMemoryBackend.readCurrentLimit() ?? 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use currentLimit here? I know there is an issue if you have multiple managers but we can document that as undefined (it seems useless except for tests).

In fact, I wonder ... we could make the init() on the actor be non-public to enforce this. The tests could still create multiple. Or add an obvious path like static func newTestPolicy() that would be more noticeable if somebody tried multiple in a real program?

return 0
}

private func ensureBaseline(refresh: Bool) -> Int {
Copy link
Collaborator

@davidkoski davidkoski Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not taking an action, it is computing the current baseline. ensure sounds like it is applying. I see it sits on top of resolve so we are already using that name :-). The key here is that it notices when the resolved baseline changes. resolveBaselineAndEmit? I am not great with the names (and it is private so ultimately just needed for clarity), see what you think.

README.md Outdated
popd
```

## Wired Memory Management
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are well written but I don't think we need them at the top level of the README. I wonder if this would be better in Source/MLX/Documentation.docc/MLX.md (the top level of the MLX docs where there are pointers to some of the articles with more information). This README is much higher level.

size: Int,
policy: any WiredMemoryPolicy,
kind: WiredMemoryTicketKind
) async -> Int {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the end this one must be async as it can suspend per the admit on the policy.

@@ -0,0 +1,121 @@
# Wired Memory Management

Coordinate a process-wide wired memory limit for GPU workloads.
Copy link
Collaborator

@davidkoski davidkoski Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a few additions:

  • deprecate Memory.withWired and GPU.withWired
  • the sync version of that can be documented as being a NOP
  • the async version can use a static policy, e.g. WiredSumPolicy (document)

We don't want a back door that will let people bypass the mechanism.

@davidkoski
Copy link
Collaborator

Looks really good! Check out my comments & questions and see what you think.

- Added WiredSumPolicy and WiredMaxPolicy
- Removed backend probing for current limit and use caching instead
- Converted WiredMemoryManager to fully singleton with testability flags
- Made policy-only mode enabled by default
- Moved Wired Memory docs from README to correct location
@robertmsale
Copy link
Contributor Author

Looks really good! Check out my comments & questions and see what you think.

Alright I carried out some of your recommendations:

  1. WiredSumPolicy and WiredMaxPolicy are both included here
  2. Deprecated synchronous Memory.withWired and GPU.withWired with documentation and default policy for async
  3. Included clarification around baseline definition
  4. Made policy-only an opt-out operation (kept policyOnlyWhenUnsupported for debugging and telemetry access, defaults to true)
  5. Moved docs from README to appropriate docs location
  6. Fixed Linux/CUDA CI build errors by gating GPU.maximumRecommendedWorkingSet() with metal compilation flag
  7. Removed await requirement from Ticket.end(...)
  8. WiredMemoryManager is a singleton and supports mid-flight reconfiguration when no tickets are active (also test fixture tooling)

There's just a few of items left:

  1. PolicyKey: We can reduce this down to AnyHashable, since you asked if we need it I thought a follow-up would be appropriate :) didn't know if adding groupings later could be helpful or not, but I have a habit of adding more when less would be appropriate so if you think it's best to flatten the key I'll do it.
  2. "Ticket start/end bookkeeping": This one you mentioned as being a programming error. I agree, but on the other hand I think it's a good safety guarantee for the reasons I mentioned in the review. Particularly with iOS apps entering the background while a Task is running (deinitializing reference-types that trigger the same events as task cancellations, something really hard to debug/test, often overlooked), or spurious tasks causing double start/end, where intentionally handling this might overcome future issues. My alternative would be to keep it, but have a debug assertion/preconditionFailure to capture whether a ticket ID was missing so devs can catch programming errors, while release mode keeps the same behavior to avoid crashing.

I learned a lot from this and appreciate your really thorough feedback!

manager: .shared,
kind: .active
)
return try await ticket.withWiredLimit(body)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@davidkoski
Copy link
Collaborator

  1. PolicyKey: We can reduce this down to AnyHashable, since you asked if we need it I thought a follow-up would be appropriate :) didn't know if adding groupings later could be helpful or not, but I have a habit of adding more when less would be appropriate so if you think it's best to flatten the key I'll do it.

No worries, I usually go the opposite way -- less is more until you know you need it. In this case it is a private implementation detail and can be changed as needed in the future without clients knowing (meaning specifically that the details are not critical to it working). Let's leave it as is.

  1. "Ticket start/end bookkeeping": This one you mentioned as being a programming error. I agree, but on the other hand I think it's a good safety guarantee for the reasons I mentioned in the review. Particularly with iOS apps entering the background while a Task is running (deinitializing reference-types that trigger the same events as task cancellations, something really hard to debug/test, often overlooked), or spurious tasks causing double start/end, where intentionally handling this might overcome future issues. My alternative would be to keep it, but have a debug assertion/preconditionFailure to capture whether a ticket ID was missing so devs can catch programming errors, while release mode keeps the same behavior to avoid crashing.

The swift policy (I claim, though now that I try to find it, can't) is for contract violations (programming errors) to typically be fatal. E.g. array indexing errors are fatal -- the developer messed something up.

You may be right that there are use cases where this could happen. Swift does provide CheckedContinuation and UnsafeContinuation potentially for a use case like this. I don't think we need to go that far, but looking at the possibilities:

I wonder if assert() or assertionFailure() might be useful here? It would stop in the debugger / crash for debug builds and pass through to the messaging infrastructure you have here -- developers would see it, customers wouldn't. That might be a happy medium.

I learned a lot from this and appreciate your really thorough feedback!

Thanks! This work has required a bit more thinking than the average port of an LLM, that is for sure!

@robertmsale
Copy link
Contributor Author

I wonder if assert() or assertionFailure() might be useful here?

Awesome! I agree this is the correct way to handle it, especially for the reason you mentioned (triggers a breakpoint in debug). I think if a dev never hits that breakpoint during integration testing or E2E, they very likely have no programming errors, and we still cover those sneaky edge cases! Assertions have been added. There's been a good amount of changes here since the first commit so my downstream PR in mlx-swift-lm is going to require some refactoring and testing, but I'm happy with how this one turned out 😁

Copy link
Collaborator

@davidkoski davidkoski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to go, thank you for all the work you put in on this!

@davidkoski davidkoski merged commit 30d0b9b into ml-explore:main Feb 6, 2026
7 checks passed
ronaldmannak pushed a commit to PicoMLX/mlx-swift that referenced this pull request Feb 6, 2026
* Wired Memory Management System
* Exposed `GPU.maxRecommendedWorkingSetBytes()`
* Added policy-only mode for CPU-only support (and future CUDA)
* Deprecated synchronous WiredMemory setting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants