fix: count tracked model memory in pre-load admission (#1623)#1766
Merged
jundot merged 1 commit intoJun 10, 2026
Conversation
The pre-load admission check in get_engine() gated on
current = max(mx.get_active_memory(), get_phys_footprint())
and never consulted the tracked accumulator _current_model_memory. After a
model settles or idles, both live-memory reads can fall well below the model's
true resident size, so loading a second large model projected under the ceiling
and admitted without evicting the first — over-committing to 57+ GB and failing
the next prefill with a memory-guard error (v0.4.0 regression; 0.3.7's removed
_ensure_memory_available() used the accumulator).
Add self._current_model_memory to the max(), making pre-load admission
consistent with the request-time prefill-eviction path, which already counts
the accumulator.
Adds a regression test where live memory under-reports (active=phys=0) but the
accumulator shows the pair over-commits: the second model must still evict the
first.
240722e to
2d5b524
Compare
Owner
|
Thanks for the fix. I checked the pre-load admission path against the prefill eviction path, and counting This looks good to me, and I'm going to merge it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1623 (v0.4.0 regression: a second large model loads side-by-side instead of switching, over-committing memory and failing the next prefill).
The pre-load admission check in
EnginePool.get_engine()gated on live memory only:It never consulted the tracked accumulator
_current_model_memory. After a model settles or idles, bothmx.get_active_memory()and the process footprint can read well below the model's true resident size (Metal releases pooled memory; the SSD hot cache isn't counted), while_current_model_memorystill reflects the committed total. So loading a second large model projected under the ceiling and was admitted without evicting the first — pushing memory to 57+ GB and failing the next request with a prefill memory-guard error.This is a v0.4.0 regression: 0.3.7's
_ensure_memory_available()(removed along withmax_model_memoryin thememory_guard_tierrefactor) evicted using the accumulator.Fix
Add
self._current_model_memoryto themax():The accumulator is a floor on committed model memory that live reads can dip below. This also makes pre-load admission consistent with the request-time prefill-eviction path (
_evict_idle_lru_for_prefill), which already takesmax(active, phys, _current_model_memory)— the pre-load gate was the lone path missing it.Test plan
test_eviction_when_live_memory_undercounts: withactive = phys = 0but the accumulator showing the pair over-commits, the second model must still evict the first. Fails onmain(model-a not evicted), passes with the fix.pytest tests/test_engine_pool.py— 83 passed (incl. the existingtest_eviction_before_load, unaffected: its fixture already proxies phys to the accumulator, somax(0, acc, acc) == acc).pytest tests/test_process_memory_enforcer.py tests/test_scheduler_admission.py tests/test_scheduler_prefill_memory_guard.py tests/test_memory_monitor.py tests/test_server_prefill_memory_handler.py— 177 passed.