fix: yield event loop after engine stop so streaming generators release engine ref before gc#1593
Merged
Merged
Conversation
…se engine ref before gc When _unload_engine() calls entry.engine.stop(), EngineCore.close() blocks the asyncio event loop while submitting scheduler.shutdown() and scheduler.deep_reset() to the single-threaded MLX executor via .result(). During that blocking period, server-side streaming generators that hold a local reference to the BatchedEngine cannot run — they are suspended at their next yield point, waiting to process the abort signal sent by abort_all_requests(). By the time stop() returns, these generator frames are still alive and their 'engine' local variable keeps the BatchedEngine's refcount above zero. Consequently, when entry.engine = None is set and gc.collect() fires immediately after, the BatchedEngine (and its self._model reference) cannot be collected. The model's ~20 GB of MLX weight tensors remain "active" in Metal memory, the settle barrier times out, and subsequent load attempts fail with 507 because the active footprint still exceeds the ceiling. Fix: after stop() returns, yield to the asyncio event loop a few times before clearing entry.engine. This drains the ready queue, allowing pending generator tear-down coroutines to run and drop their engine references. With refcount at zero, entry.engine = None triggers immediate CPython deallocation and gc.collect() finds nothing left to hold the model in active memory.
Owner
|
Thanks, this matches the unload path I checked: the memory enforcer aborts active requests before eviction, and the streaming cleanup runs from EngineCore.stream_outputs() once the generator gets a chance to resume. Yielding before clearing the pool reference is a small, low-risk mitigation for the stale engine reference held by suspended streaming frames. CI is green, and this looks good to me. One thing I may fold into a follow-up is a regression test or stronger drain around slow StreamingResponse consumers, since the fixed sleep(0) count is still a heuristic. This looks good to me, and I am going to merge it. |
This was referenced Jun 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a model is evicted under memory pressure, its ~20 GB of MLX weight tensors are not freed in time, the settle barrier times out, and the next load attempt fails with 507 even though the dashboard shows no loaded models.
Root cause
The sequence that triggers this is:
process_memory_enforcer._check_and_enforce()detects memory pressure and callsabort_all_requests()on the victim model, which sets asyncioEventobjects for every active request — scheduling server-side streaming generators in the asyncio ready queue._unload_engine()is then called. Insideentry.engine.stop(),EngineCore.close()runs synchronously (blocking the event loop) to submitscheduler.shutdown()andscheduler.deep_reset()to the single-threaded MLX executor via.result().While the event loop is blocked, the streaming generators scheduled in step 1 cannot run. They remain suspended, each holding a local
enginevariable that keeps theBatchedEngine's Python refcount above zero.stop()returns. We setentry.engine = Noneand callgc.collect()immediately — but the generators are still alive.BatchedEnginerefcount is still > 0. The model's MLX weight tensors stay active in Metal memory.The settle barrier polls
mx.get_active_memory()for up to 5 s and finds ~19–20 GB still pinned. It times out. Emergency reclaim also fails. A subsequent load attempt computescurrent + model_size > ceilingand returns 507.Reproducer (from real log)
The 19.66 GB of "active" memory is Qwen's weights held alive by a suspended streaming generator.
Fix
Add five
await asyncio.sleep(0)calls betweenentry.engine.stop()andentry.engine = Nonein_unload_engine().asyncio.sleep(0)yields control to the event loop without sleeping, draining the ready queue. Server streaming generators are at most a few frames deep, so five iterations are sufficient for them to process the abort error, enter theirfinallyblocks, call_cleanup_request(), close, and drop theirenginereference. By the time we setentry.engine = None, theBatchedEnginerefcount is zero and CPython's reference counter frees it immediately.gc.collect()then finds no remaining live references, the MLX arrays are freed, Metal buffers are returned to the cache, andmx.clear_cache()releases them. The settle barrier succeeds.Files changed
omlx/engine_pool.pyawait asyncio.sleep(0)in_unload_engine()betweenstop()andentry.engine = NoneNotes
EngineCore.close()blocking the event loop during Metal shutdown is a separate issue; this fix works around it at the call site without requiring changes to the Metal/MLX shutdown path.