Skip to content

fix: reclaim hot cache memory retained after model unload#1713

Merged
jundot merged 3 commits into
jundot:mainfrom
khsd6327:fix/hot-cache-retained-memory
Jun 7, 2026
Merged

fix: reclaim hot cache memory retained after model unload#1713
jundot merged 3 commits into
jundot:mainfrom
khsd6327:fix/hot-cache-retained-memory

Conversation

@khsd6327

@khsd6327 khsd6327 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

After a model is unloaded the server can keep tens of GB of resident memory
belonging to a hot cache no longer attached to any model. The memory guard
counts that footprint against the load ceiling, so the next load is rejected
with 507, and POST /api/hot-cache/clear returns ok but frees nothing. The
only recovery is a process restart.

Reproduction

  1. Serve a model and send a large request so the hot cache fills (a prompt
    around 100k tokens).
  2. Let an abort and an unload interleave, as a multi step workflow produces, or
    otherwise unload through a path where engine teardown does not complete.
  3. /api/status shows loaded_count: 0 and current_model_memory: 0, yet the
    process still holds a large resident set (about 57 GB in my case).
  4. POST /api/hot-cache/clear returns ok but resident memory does not move.
  5. Loading a larger model is rejected with 507:
507: Cannot load gemma-4-31B-it: projected memory 129.88GB would exceed the
memory ceiling 107.52GB (current: 68.72GB, model: 61.16GB).

current in the guard is max(mx.get_active_memory(), get_phys_footprint()),
so the retained footprint counts even with nothing loaded. A clean unload
reclaims fully, so this only appears when teardown does not complete.

Changes (three commits)

  1. fix(engine): always release SSD cache manager on engine close.
    EngineCore.close() only caught RuntimeError from scheduler.shutdown and
    deep_reset, so any other exception aborted close() before it reached
    PagedSSDCacheManager.close(). The manager's writer thread then kept it and
    its hot cache alive until restart. Swallow non-RuntimeError from
    shutdown/deep_reset, then close the SSD manager explicitly if it is still
    set. This removes the cause of the orphan on the teardown path.

  2. feat(cache): reclaim hot cache orphaned by an abnormal teardown.
    SharedHotCacheBudget keeps a strong reference to each manager through
    _HotCacheBudgetEntry.owner, so a manager orphaned by a skipped close() is
    unreachable through the loaded-only iterator. Add clear_all_owners(), which
    clears the hot cache of every manager the budget still references and logs
    per-owner failures instead of swallowing them. The strong owner reference was
    added in 6dcf197.

  3. fix(admin): reclaim hot cache memory on clear via the synchronized path.
    clear_hot_cache now reclaims MLX's buffer pool after clearing the hot cache
    dicts, and calls clear_all_owners() so orphans are reached through the
    budget. The reclaim runs scheduler._sync_and_clear_cache(stream) on each
    loaded engine's own executor, targeting that engine's _stream, which is the
    same _mx_buffer_access_lock and synchronize barrier generation uses
    (issues Kernel panic after running few minutes #300, Title: _extract_tensor_bytes SIGABRT on hybrid model (Qwen3.6-35B-A3B) with SSD cache — reproducible on 0.3.8.x and 0.3.9.dev1 #1106). With no model loaded it falls back to the global
    executor. It returns bytes_reclaimed, measured before the dicts are cleared
    so raw host-byte entries are counted.

Safety

The reclaim no longer calls bare mx.clear_cache() on the global executor.
_sync_and_clear_cache holds _mx_buffer_access_lock and synchronizes the
engine stream and the default stream before clear_cache(), so it does not
release buffers still referenced by an active engine stream or read by the async
store-cache worker (issues #300, #1106, #435). Running it on each engine's own
executor mirrors how scheduler reclaims internally.

Test plan

  1. pytest tests/test_admin_hot_cache_clear.py tests/test_paged_ssd_cache.py tests/test_engine_core.py
  2. Full suite.
  3. Real MLX smoke: an orphaned manager left in the budget with no model loaded
    is reclaimed through the synchronized path, and mx.get_cache_memory() drops
    to zero.

Related

Same symptom, different trigger: #1322 (RAM not released after a cancelled
download, loads blocked until restart) and #1060 (memory stuck after a VLM
unload, subsequent loads fail with 507). PR #1593 (merged) fixed the adjacent
teardown case where streaming generators held the engine ref after stop; the
close() gap fixed here is in the same teardown path. The strong owner
reference that pins an orphaned manager was introduced in 6dcf197.

khsd6327 added 3 commits June 6, 2026 23:13
EngineCore.close() only caught RuntimeError from scheduler.shutdown and
deep_reset, so any other exception aborted close() before scheduler.shutdown
reached PagedSSDCacheManager.close(). The manager's writer thread then kept the
manager and its hot cache alive until process restart.

Swallow non-RuntimeError from shutdown/deep_reset (best-effort teardown) and,
after the loop, close the SSD cache manager explicitly if it is still set.
SharedHotCacheBudget keeps a strong reference to each owning
PagedSSDCacheManager via _HotCacheBudgetEntry.owner, so a manager orphaned by a
teardown that skipped close() stays resident and is unreachable through the
loaded-only scheduler iterator. Add clear_all_owners(), which clears the hot
cache of every manager the budget still references and logs per-owner failures
instead of swallowing them.
POST /api/hot-cache/clear returned ok but freed no resident memory after models
were unloaded, so the memory guard kept rejecting loads with 507. Clear the
per-model hot cache dicts, call clear_all_owners() for orphans, then reclaim
MLX's buffer pool with scheduler._sync_and_clear_cache() run on each loaded
engine's own executor for its stream (the same _mx_buffer_access_lock and
synchronize barrier generation uses, issues jundot#300/jundot#1106), falling back to the
global executor when no model is loaded. Return bytes_reclaimed, measured before
the dicts are cleared.
@khsd6327 khsd6327 force-pushed the fix/hot-cache-retained-memory branch from aa98320 to e1deb8d Compare June 7, 2026 03:18
@jundot

jundot commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Thanks for the careful fix and for documenting the failure mode so clearly. I checked the teardown path and the admin hot-cache clear path, and this closes the gap where retained hot-cache memory could survive after unload and block the next load.

The synchronized reclaim path and the shared-budget orphan cleanup fit the existing cache safety constraints well. This looks good to me, and I'm going to merge it. I'll handle one small cleanup separately after merge.

@jundot jundot merged commit e5281c2 into jundot:main Jun 7, 2026
4 checks passed
@khsd6327 khsd6327 deleted the fix/hot-cache-retained-memory branch June 7, 2026 05:49
khsd6327 added a commit to khsd6327/omlx that referenced this pull request Jun 7, 2026
Lessons from PR jundot#1713: branch PRs off origin/main (not fork main) to avoid
fork-drift; PRs cannot be deleted; never bare mx.clear_cache(), use
scheduler._sync_and_clear_cache on the per-engine executor/stream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants