Skip to content

fix: count tracked model memory in pre-load admission (#1623)#1766

Merged
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/preload-admission-accumulator-1623
Jun 10, 2026
Merged

fix: count tracked model memory in pre-load admission (#1623)#1766
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/preload-admission-accumulator-1623

Conversation

@popfido

@popfido popfido commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1623 (v0.4.0 regression: a second large model loads side-by-side instead of switching, over-committing memory and failing the next prefill).

The pre-load admission check in EnginePool.get_engine() gated on live memory only:

current = max(mx.get_active_memory(), get_phys_footprint())
projected = current + entry.estimated_size
if projected <= ceiling: break   # admit, no eviction

It never consulted the tracked accumulator _current_model_memory. After a model settles or idles, both mx.get_active_memory() and the process footprint can read well below the model's true resident size (Metal releases pooled memory; the SSD hot cache isn't counted), while _current_model_memory still reflects the committed total. So loading a second large model projected under the ceiling and was admitted without evicting the first — pushing memory to 57+ GB and failing the next request with a prefill memory-guard error.

This is a v0.4.0 regression: 0.3.7's _ensure_memory_available() (removed along with max_model_memory in the memory_guard_tier refactor) evicted using the accumulator.

Fix

Add self._current_model_memory to the max():

current = max(
    mx.get_active_memory(),
    get_phys_footprint(),
    self._current_model_memory,
)

The accumulator is a floor on committed model memory that live reads can dip below. This also makes pre-load admission consistent with the request-time prefill-eviction path (_evict_idle_lru_for_prefill), which already takes max(active, phys, _current_model_memory) — the pre-load gate was the lone path missing it.

Test plan

  • New regression test test_eviction_when_live_memory_undercounts: with active = phys = 0 but the accumulator showing the pair over-commits, the second model must still evict the first. Fails on main (model-a not evicted), passes with the fix.
  • pytest tests/test_engine_pool.py — 83 passed (incl. the existing test_eviction_before_load, unaffected: its fixture already proxies phys to the accumulator, so max(0, acc, acc) == acc).
  • pytest tests/test_process_memory_enforcer.py tests/test_scheduler_admission.py tests/test_scheduler_prefill_memory_guard.py tests/test_memory_monitor.py tests/test_server_prefill_memory_handler.py — 177 passed.
  • No new ruff findings.

The pre-load admission check in get_engine() gated on
    current = max(mx.get_active_memory(), get_phys_footprint())
and never consulted the tracked accumulator _current_model_memory. After a
model settles or idles, both live-memory reads can fall well below the model's
true resident size, so loading a second large model projected under the ceiling
and admitted without evicting the first — over-committing to 57+ GB and failing
the next prefill with a memory-guard error (v0.4.0 regression; 0.3.7's removed
_ensure_memory_available() used the accumulator).

Add self._current_model_memory to the max(), making pre-load admission
consistent with the request-time prefill-eviction path, which already counts
the accumulator.

Adds a regression test where live memory under-reports (active=phys=0) but the
accumulator shows the pair over-commits: the second model must still evict the
first.
@popfido popfido force-pushed the fix/preload-admission-accumulator-1623 branch from 240722e to 2d5b524 Compare June 9, 2026 10:05
@jundot

jundot commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Thanks for the fix. I checked the pre-load admission path against the prefill eviction path, and counting _current_model_memory here makes the decision consistent when live memory under-reports. The regression test covers the missing case cleanly, and the related memory/scheduler tests pass on my side.

This looks good to me, and I'm going to merge it.

@jundot jundot merged commit be18e07 into jundot:main Jun 10, 2026
4 checks passed
@popfido popfido deleted the fix/preload-admission-accumulator-1623 branch June 10, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.4.0 regression: Pre-load eviction fails when loading a second large model

2 participants