fix: reclaim hot cache memory retained after model unload#1713
Merged
Conversation
EngineCore.close() only caught RuntimeError from scheduler.shutdown and deep_reset, so any other exception aborted close() before scheduler.shutdown reached PagedSSDCacheManager.close(). The manager's writer thread then kept the manager and its hot cache alive until process restart. Swallow non-RuntimeError from shutdown/deep_reset (best-effort teardown) and, after the loop, close the SSD cache manager explicitly if it is still set.
SharedHotCacheBudget keeps a strong reference to each owning PagedSSDCacheManager via _HotCacheBudgetEntry.owner, so a manager orphaned by a teardown that skipped close() stays resident and is unreachable through the loaded-only scheduler iterator. Add clear_all_owners(), which clears the hot cache of every manager the budget still references and logs per-owner failures instead of swallowing them.
POST /api/hot-cache/clear returned ok but freed no resident memory after models were unloaded, so the memory guard kept rejecting loads with 507. Clear the per-model hot cache dicts, call clear_all_owners() for orphans, then reclaim MLX's buffer pool with scheduler._sync_and_clear_cache() run on each loaded engine's own executor for its stream (the same _mx_buffer_access_lock and synchronize barrier generation uses, issues jundot#300/jundot#1106), falling back to the global executor when no model is loaded. Return bytes_reclaimed, measured before the dicts are cleared.
aa98320 to
e1deb8d
Compare
Owner
|
Thanks for the careful fix and for documenting the failure mode so clearly. I checked the teardown path and the admin hot-cache clear path, and this closes the gap where retained hot-cache memory could survive after unload and block the next load. The synchronized reclaim path and the shared-budget orphan cleanup fit the existing cache safety constraints well. This looks good to me, and I'm going to merge it. I'll handle one small cleanup separately after merge. |
khsd6327
added a commit
to khsd6327/omlx
that referenced
this pull request
Jun 7, 2026
Lessons from PR jundot#1713: branch PRs off origin/main (not fork main) to avoid fork-drift; PRs cannot be deleted; never bare mx.clear_cache(), use scheduler._sync_and_clear_cache on the per-engine executor/stream.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After a model is unloaded the server can keep tens of GB of resident memory
belonging to a hot cache no longer attached to any model. The memory guard
counts that footprint against the load ceiling, so the next load is rejected
with
507, andPOST /api/hot-cache/clearreturns ok but frees nothing. Theonly recovery is a process restart.
Reproduction
around 100k tokens).
otherwise unload through a path where engine teardown does not complete.
/api/statusshowsloaded_count: 0andcurrent_model_memory: 0, yet theprocess still holds a large resident set (about 57 GB in my case).
POST /api/hot-cache/clearreturns ok but resident memory does not move.507:currentin the guard ismax(mx.get_active_memory(), get_phys_footprint()),so the retained footprint counts even with nothing loaded. A clean unload
reclaims fully, so this only appears when teardown does not complete.
Changes (three commits)
fix(engine): always release SSD cache manager on engine close.EngineCore.close()only caughtRuntimeErrorfromscheduler.shutdownanddeep_reset, so any other exception abortedclose()before it reachedPagedSSDCacheManager.close(). The manager's writer thread then kept it andits hot cache alive until restart. Swallow non-
RuntimeErrorfromshutdown/deep_reset, then close the SSD manager explicitly if it is stillset. This removes the cause of the orphan on the teardown path.
feat(cache): reclaim hot cache orphaned by an abnormal teardown.SharedHotCacheBudgetkeeps a strong reference to each manager through_HotCacheBudgetEntry.owner, so a manager orphaned by a skippedclose()isunreachable through the loaded-only iterator. Add
clear_all_owners(), whichclears the hot cache of every manager the budget still references and logs
per-owner failures instead of swallowing them. The strong owner reference was
added in
6dcf197.fix(admin): reclaim hot cache memory on clear via the synchronized path.clear_hot_cachenow reclaims MLX's buffer pool after clearing the hot cachedicts, and calls
clear_all_owners()so orphans are reached through thebudget. The reclaim runs
scheduler._sync_and_clear_cache(stream)on eachloaded engine's own executor, targeting that engine's
_stream, which is thesame
_mx_buffer_access_lockandsynchronizebarrier generation uses(issues Kernel panic after running few minutes #300, Title: _extract_tensor_bytes SIGABRT on hybrid model (Qwen3.6-35B-A3B) with SSD cache — reproducible on 0.3.8.x and 0.3.9.dev1 #1106). With no model loaded it falls back to the global
executor. It returns
bytes_reclaimed, measured before the dicts are clearedso raw host-byte entries are counted.
Safety
The reclaim no longer calls bare
mx.clear_cache()on the global executor._sync_and_clear_cacheholds_mx_buffer_access_lockand synchronizes theengine stream and the default stream before
clear_cache(), so it does notrelease buffers still referenced by an active engine stream or read by the async
store-cache worker (issues #300, #1106, #435). Running it on each engine's own
executor mirrors how
schedulerreclaims internally.Test plan
pytest tests/test_admin_hot_cache_clear.py tests/test_paged_ssd_cache.py tests/test_engine_core.pyis reclaimed through the synchronized path, and
mx.get_cache_memory()dropsto zero.
Related
Same symptom, different trigger: #1322 (RAM not released after a cancelled
download, loads blocked until restart) and #1060 (memory stuck after a VLM
unload, subsequent loads fail with
507). PR #1593 (merged) fixed the adjacentteardown case where streaming generators held the engine ref after stop; the
close()gap fixed here is in the same teardown path. The strong ownerreference that pins an orphaned manager was introduced in
6dcf197.