Skip to content

fix: yield event loop after engine stop so streaming generators release engine ref before gc#1593

Merged
jundot merged 1 commit into
jundot:mainfrom
fqx:fix/gc-yield-after-engine-stop
Jun 2, 2026
Merged

fix: yield event loop after engine stop so streaming generators release engine ref before gc#1593
jundot merged 1 commit into
jundot:mainfrom
fqx:fix/gc-yield-after-engine-stop

Conversation

@fqx

@fqx fqx commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Problem

When a model is evicted under memory pressure, its ~20 GB of MLX weight tensors are not freed in time, the settle barrier times out, and the next load attempt fails with 507 even though the dashboard shows no loaded models.

Root cause

The sequence that triggers this is:

  1. process_memory_enforcer._check_and_enforce() detects memory pressure and calls abort_all_requests() on the victim model, which sets asyncio Event objects for every active request — scheduling server-side streaming generators in the asyncio ready queue.

  2. _unload_engine() is then called. Inside entry.engine.stop(), EngineCore.close() runs synchronously (blocking the event loop) to submit scheduler.shutdown() and scheduler.deep_reset() to the single-threaded MLX executor via .result().

  3. While the event loop is blocked, the streaming generators scheduled in step 1 cannot run. They remain suspended, each holding a local engine variable that keeps the BatchedEngine's Python refcount above zero.

  4. stop() returns. We set entry.engine = None and call gc.collect() immediately — but the generators are still alive. BatchedEngine refcount is still > 0. The model's MLX weight tensors stay active in Metal memory.

  5. The settle barrier polls mx.get_active_memory() for up to 5 s and finds ~19–20 GB still pinned. It times out. Emergency reclaim also fails. A subsequent load attempt computes current + model_size > ceiling and returns 507.

Reproducer (from real log)

Evicting 'HY-MT1.5-1.8B-4bit' to fit 'Qwen3.6-35B-A3B-oQ4' under memory ceiling (42.98GB > 37.44GB)
...
Unloaded model: HY-MT1.5-1.8B-4bit, freed=1001.26MB (expected>=0.00B), active_memory: 19.66GB (settled)
POST /v1/chat/completions → 507: Cannot load Qwen3.6-35B-A3B-oQ4: projected memory 42.00GB would exceed the memory ceiling 37.44GB (current: 21.35GB, model: 20.64GB)

The 19.66 GB of "active" memory is Qwen's weights held alive by a suspended streaming generator.

Fix

Add five await asyncio.sleep(0) calls between entry.engine.stop() and entry.engine = None in _unload_engine().

asyncio.sleep(0) yields control to the event loop without sleeping, draining the ready queue. Server streaming generators are at most a few frames deep, so five iterations are sufficient for them to process the abort error, enter their finally blocks, call _cleanup_request(), close, and drop their engine reference. By the time we set entry.engine = None, the BatchedEngine refcount is zero and CPython's reference counter frees it immediately. gc.collect() then finds no remaining live references, the MLX arrays are freed, Metal buffers are returned to the cache, and mx.clear_cache() releases them. The settle barrier succeeds.

# _unload_engine(), after entry.engine.stop() returns:

for _ in range(5):
    await asyncio.sleep(0)   # let pending generators release their engine ref

entry.engine = None          # refcount now 0 -> BatchedEngine freed immediately
gc.collect()                 # model tensors freed -> Metal buffers enter cache
await loop.run_in_executor(
    get_mlx_executor(), lambda: (mx.synchronize(), mx.clear_cache())
)
# settle barrier now sees active_memory drop as expected

Files changed

File Change
omlx/engine_pool.py Add 5x await asyncio.sleep(0) in _unload_engine() between stop() and entry.engine = None

Notes

  • The fix adds at most ~0 ms of actual wall time (no sleep, pure event-loop scheduling).
  • It does not change the settle barrier logic, the emergency reclaim path, or any other behaviour.
  • The underlying cause of EngineCore.close() blocking the event loop during Metal shutdown is a separate issue; this fix works around it at the call site without requiring changes to the Metal/MLX shutdown path.

…se engine ref before gc

When _unload_engine() calls entry.engine.stop(), EngineCore.close()
blocks the asyncio event loop while submitting scheduler.shutdown() and
scheduler.deep_reset() to the single-threaded MLX executor via .result().

During that blocking period, server-side streaming generators that hold a
local reference to the BatchedEngine cannot run — they are suspended at
their next yield point, waiting to process the abort signal sent by
abort_all_requests(). By the time stop() returns, these generator frames
are still alive and their 'engine' local variable keeps the BatchedEngine's
refcount above zero.

Consequently, when entry.engine = None is set and gc.collect() fires
immediately after, the BatchedEngine (and its self._model reference) cannot
be collected. The model's ~20 GB of MLX weight tensors remain "active" in
Metal memory, the settle barrier times out, and subsequent load attempts
fail with 507 because the active footprint still exceeds the ceiling.

Fix: after stop() returns, yield to the asyncio event loop a few times
before clearing entry.engine. This drains the ready queue, allowing
pending generator tear-down coroutines to run and drop their engine
references. With refcount at zero, entry.engine = None triggers immediate
CPython deallocation and gc.collect() finds nothing left to hold the model
in active memory.
@jundot

jundot commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Thanks, this matches the unload path I checked: the memory enforcer aborts active requests before eviction, and the streaming cleanup runs from EngineCore.stream_outputs() once the generator gets a chance to resume.

Yielding before clearing the pool reference is a small, low-risk mitigation for the stale engine reference held by suspended streaming frames. CI is green, and this looks good to me. One thing I may fold into a follow-up is a regression test or stronger drain around slow StreamingResponse consumers, since the fixed sleep(0) count is still a heuristic.

This looks good to me, and I am going to merge it.

@jundot jundot merged commit 6eaba85 into jundot:main Jun 2, 2026
4 checks passed
@fqx fqx deleted the fix/gc-yield-after-engine-stop branch June 2, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants