fix: yield event loop after engine stop so streaming generators release engine ref before gc by fqx · Pull Request #1593 · jundot/omlx

fqx · 2026-06-02T02:37:45Z

Problem

When a model is evicted under memory pressure, its ~20 GB of MLX weight tensors are not freed in time, the settle barrier times out, and the next load attempt fails with 507 even though the dashboard shows no loaded models.

Root cause

The sequence that triggers this is:

process_memory_enforcer._check_and_enforce() detects memory pressure and calls abort_all_requests() on the victim model, which sets asyncio Event objects for every active request — scheduling server-side streaming generators in the asyncio ready queue.
_unload_engine() is then called. Inside entry.engine.stop(), EngineCore.close() runs synchronously (blocking the event loop) to submit scheduler.shutdown() and scheduler.deep_reset() to the single-threaded MLX executor via .result().
While the event loop is blocked, the streaming generators scheduled in step 1 cannot run. They remain suspended, each holding a local engine variable that keeps the BatchedEngine's Python refcount above zero.
stop() returns. We set entry.engine = None and call gc.collect() immediately — but the generators are still alive. BatchedEngine refcount is still > 0. The model's MLX weight tensors stay active in Metal memory.
The settle barrier polls mx.get_active_memory() for up to 5 s and finds ~19–20 GB still pinned. It times out. Emergency reclaim also fails. A subsequent load attempt computes current + model_size > ceiling and returns 507.

Reproducer (from real log)

Evicting 'HY-MT1.5-1.8B-4bit' to fit 'Qwen3.6-35B-A3B-oQ4' under memory ceiling (42.98GB > 37.44GB)
...
Unloaded model: HY-MT1.5-1.8B-4bit, freed=1001.26MB (expected>=0.00B), active_memory: 19.66GB (settled)
POST /v1/chat/completions → 507: Cannot load Qwen3.6-35B-A3B-oQ4: projected memory 42.00GB would exceed the memory ceiling 37.44GB (current: 21.35GB, model: 20.64GB)

The 19.66 GB of "active" memory is Qwen's weights held alive by a suspended streaming generator.

Fix

Add five await asyncio.sleep(0) calls between entry.engine.stop() and entry.engine = None in _unload_engine().

asyncio.sleep(0) yields control to the event loop without sleeping, draining the ready queue. Server streaming generators are at most a few frames deep, so five iterations are sufficient for them to process the abort error, enter their finally blocks, call _cleanup_request(), close, and drop their engine reference. By the time we set entry.engine = None, the BatchedEngine refcount is zero and CPython's reference counter frees it immediately. gc.collect() then finds no remaining live references, the MLX arrays are freed, Metal buffers are returned to the cache, and mx.clear_cache() releases them. The settle barrier succeeds.

# _unload_engine(), after entry.engine.stop() returns:

for _ in range(5):
    await asyncio.sleep(0)   # let pending generators release their engine ref

entry.engine = None          # refcount now 0 -> BatchedEngine freed immediately
gc.collect()                 # model tensors freed -> Metal buffers enter cache
await loop.run_in_executor(
    get_mlx_executor(), lambda: (mx.synchronize(), mx.clear_cache())
)
# settle barrier now sees active_memory drop as expected

Files changed

File	Change
`omlx/engine_pool.py`	Add 5x `await asyncio.sleep(0)` in `_unload_engine()` between `stop()` and `entry.engine = None`

Notes

The fix adds at most ~0 ms of actual wall time (no sleep, pure event-loop scheduling).
It does not change the settle barrier logic, the emergency reclaim path, or any other behaviour.
The underlying cause of EngineCore.close() blocking the event loop during Metal shutdown is a separate issue; this fix works around it at the call site without requiring changes to the Metal/MLX shutdown path.

…se engine ref before gc When _unload_engine() calls entry.engine.stop(), EngineCore.close() blocks the asyncio event loop while submitting scheduler.shutdown() and scheduler.deep_reset() to the single-threaded MLX executor via .result(). During that blocking period, server-side streaming generators that hold a local reference to the BatchedEngine cannot run — they are suspended at their next yield point, waiting to process the abort signal sent by abort_all_requests(). By the time stop() returns, these generator frames are still alive and their 'engine' local variable keeps the BatchedEngine's refcount above zero. Consequently, when entry.engine = None is set and gc.collect() fires immediately after, the BatchedEngine (and its self._model reference) cannot be collected. The model's ~20 GB of MLX weight tensors remain "active" in Metal memory, the settle barrier times out, and subsequent load attempts fail with 507 because the active footprint still exceeds the ceiling. Fix: after stop() returns, yield to the asyncio event loop a few times before clearing entry.engine. This drains the ready queue, allowing pending generator tear-down coroutines to run and drop their engine references. With refcount at zero, entry.engine = None triggers immediate CPython deallocation and gc.collect() finds nothing left to hold the model in active memory.

jundot · 2026-06-02T03:34:36Z

Thanks, this matches the unload path I checked: the memory enforcer aborts active requests before eviction, and the streaming cleanup runs from EngineCore.stream_outputs() once the generator gets a chance to resume.

Yielding before clearing the pool reference is a small, low-risk mitigation for the stale engine reference held by suspended streaming frames. CI is green, and this looks good to me. One thing I may fold into a follow-up is a regression test or stronger drain around slow StreamingResponse consumers, since the fixed sleep(0) count is still a heuristic.

This looks good to me, and I am going to merge it.

jundot merged commit 6eaba85 into jundot:main Jun 2, 2026
4 checks passed

fqx deleted the fix/gc-yield-after-engine-stop branch June 2, 2026 03:36

This was referenced Jun 7, 2026

fix: reclaim hot cache memory retained after model unload #1712

Closed

fix: reclaim hot cache memory retained after model unload #1713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: yield event loop after engine stop so streaming generators release engine ref before gc#1593

fix: yield event loop after engine stop so streaming generators release engine ref before gc#1593
jundot merged 1 commit into
jundot:mainfrom
fqx:fix/gc-yield-after-engine-stop

fqx commented Jun 2, 2026

Uh oh!

jundot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fqx commented Jun 2, 2026

Problem

Root cause

Reproducer (from real log)

Fix

Files changed

Notes

Uh oh!

jundot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants