fix: reclaim hot cache memory retained after model unload by khsd6327 · Pull Request #1713 · jundot/omlx

khsd6327 · 2026-06-07T02:40:13Z

After a model is unloaded the server can keep tens of GB of resident memory
belonging to a hot cache no longer attached to any model. The memory guard
counts that footprint against the load ceiling, so the next load is rejected
with 507, and POST /api/hot-cache/clear returns ok but frees nothing. The
only recovery is a process restart.

Reproduction

Serve a model and send a large request so the hot cache fills (a prompt
around 100k tokens).
Let an abort and an unload interleave, as a multi step workflow produces, or
otherwise unload through a path where engine teardown does not complete.
/api/status shows loaded_count: 0 and current_model_memory: 0, yet the
process still holds a large resident set (about 57 GB in my case).
POST /api/hot-cache/clear returns ok but resident memory does not move.
Loading a larger model is rejected with 507:

507: Cannot load gemma-4-31B-it: projected memory 129.88GB would exceed the
memory ceiling 107.52GB (current: 68.72GB, model: 61.16GB).

current in the guard is max(mx.get_active_memory(), get_phys_footprint()),
so the retained footprint counts even with nothing loaded. A clean unload
reclaims fully, so this only appears when teardown does not complete.

Changes (three commits)

fix(engine): always release SSD cache manager on engine close.
EngineCore.close() only caught RuntimeError from scheduler.shutdown and
deep_reset, so any other exception aborted close() before it reached
PagedSSDCacheManager.close(). The manager's writer thread then kept it and
its hot cache alive until restart. Swallow non-RuntimeError from
shutdown/deep_reset, then close the SSD manager explicitly if it is still
set. This removes the cause of the orphan on the teardown path.
feat(cache): reclaim hot cache orphaned by an abnormal teardown.
SharedHotCacheBudget keeps a strong reference to each manager through
_HotCacheBudgetEntry.owner, so a manager orphaned by a skipped close() is
unreachable through the loaded-only iterator. Add clear_all_owners(), which
clears the hot cache of every manager the budget still references and logs
per-owner failures instead of swallowing them. The strong owner reference was
added in 6dcf197.
fix(admin): reclaim hot cache memory on clear via the synchronized path.
clear_hot_cache now reclaims MLX's buffer pool after clearing the hot cache
dicts, and calls clear_all_owners() so orphans are reached through the
budget. The reclaim runs scheduler._sync_and_clear_cache(stream) on each
loaded engine's own executor, targeting that engine's _stream, which is the
same _mx_buffer_access_lock and synchronize barrier generation uses
(issues Kernel panic after running few minutes #300, Title: _extract_tensor_bytes SIGABRT on hybrid model (Qwen3.6-35B-A3B) with SSD cache — reproducible on 0.3.8.x and 0.3.9.dev1 #1106). With no model loaded it falls back to the global
executor. It returns bytes_reclaimed, measured before the dicts are cleared
so raw host-byte entries are counted.

Safety

The reclaim no longer calls bare mx.clear_cache() on the global executor.
_sync_and_clear_cache holds _mx_buffer_access_lock and synchronizes the
engine stream and the default stream before clear_cache(), so it does not
release buffers still referenced by an active engine stream or read by the async
store-cache worker (issues #300, #1106, #435). Running it on each engine's own
executor mirrors how scheduler reclaims internally.

Test plan

pytest tests/test_admin_hot_cache_clear.py tests/test_paged_ssd_cache.py tests/test_engine_core.py
Full suite.
Real MLX smoke: an orphaned manager left in the budget with no model loaded
is reclaimed through the synchronized path, and mx.get_cache_memory() drops
to zero.

Same symptom, different trigger: #1322 (RAM not released after a cancelled
download, loads blocked until restart) and #1060 (memory stuck after a VLM
unload, subsequent loads fail with 507). PR #1593 (merged) fixed the adjacent
teardown case where streaming generators held the engine ref after stop; the
close() gap fixed here is in the same teardown path. The strong owner
reference that pins an orphaned manager was introduced in 6dcf197.

EngineCore.close() only caught RuntimeError from scheduler.shutdown and deep_reset, so any other exception aborted close() before scheduler.shutdown reached PagedSSDCacheManager.close(). The manager's writer thread then kept the manager and its hot cache alive until process restart. Swallow non-RuntimeError from shutdown/deep_reset (best-effort teardown) and, after the loop, close the SSD cache manager explicitly if it is still set.

SharedHotCacheBudget keeps a strong reference to each owning PagedSSDCacheManager via _HotCacheBudgetEntry.owner, so a manager orphaned by a teardown that skipped close() stays resident and is unreachable through the loaded-only scheduler iterator. Add clear_all_owners(), which clears the hot cache of every manager the budget still references and logs per-owner failures instead of swallowing them.

POST /api/hot-cache/clear returned ok but freed no resident memory after models were unloaded, so the memory guard kept rejecting loads with 507. Clear the per-model hot cache dicts, call clear_all_owners() for orphans, then reclaim MLX's buffer pool with scheduler._sync_and_clear_cache() run on each loaded engine's own executor for its stream (the same _mx_buffer_access_lock and synchronize barrier generation uses, issues jundot#300/jundot#1106), falling back to the global executor when no model is loaded. Return bytes_reclaimed, measured before the dicts are cleared.

jundot · 2026-06-07T05:22:28Z

Thanks for the careful fix and for documenting the failure mode so clearly. I checked the teardown path and the admin hot-cache clear path, and this closes the gap where retained hot-cache memory could survive after unload and block the next load.

The synchronized reclaim path and the shared-budget orphan cleanup fit the existing cache safety constraints well. This looks good to me, and I'm going to merge it. I'll handle one small cleanup separately after merge.

Lessons from PR jundot#1713: branch PRs off origin/main (not fork main) to avoid fork-drift; PRs cannot be deleted; never bare mx.clear_cache(), use scheduler._sync_and_clear_cache on the per-engine executor/stream.

khsd6327 added 3 commits June 6, 2026 23:13

khsd6327 force-pushed the fix/hot-cache-retained-memory branch from aa98320 to e1deb8d Compare June 7, 2026 03:18

jundot merged commit e5281c2 into jundot:main Jun 7, 2026
4 checks passed

khsd6327 deleted the fix/hot-cache-retained-memory branch June 7, 2026 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reclaim hot cache memory retained after model unload#1713

fix: reclaim hot cache memory retained after model unload#1713
jundot merged 3 commits into
jundot:mainfrom
khsd6327:fix/hot-cache-retained-memory

khsd6327 commented Jun 7, 2026 •

edited

Loading

Uh oh!

jundot commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khsd6327 commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduction

Changes (three commits)

Safety

Test plan

Related

Uh oh!

jundot commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khsd6327 commented Jun 7, 2026 •

edited

Loading