Fix engine-pool acquire-vs-use eviction race (#1667) + active-request counter leak (#1595)#1668
Conversation
…equest counter leak (jundot#1595) Two related memory-enforcer bugs that intermittently 500 /v1/embeddings and /v1/rerank with `RuntimeError: Engine not started. Call start() first.` jundot#1667 (acquire-vs-use race): EnginePool.get_engine() returns the engine then releases the pool lock. For non-streaming engines (embedding/reranker) the in-use signal (_active_count via _begin_activity) is registered only AFTER the started guard inside embed()/rerank(), so has_active_requests() reads False for the whole acquire->use window. The ~1s memory enforcer poll can take the same lock and evict the idle-looking engine mid-request. Fix: an atomic in-use lease taken under the pool lock at acquire time. - EngineEntry.in_use refcount; get_engine(_lease=True) increments at both return points; release_engine() decrements (floored at 0); acquire() asynccontextmanager leases and always releases in finally (exception-safe). - All three eviction paths skip in_use > 0: _find_lru_victim, _is_idle_for_prefill_eviction, check_ttl_expirations. - server.py threads _lease/_leased_out so the EXACT pool-resolved model_id (incl. default-model fallback) is captured; releases on validation failure; /v1/embeddings and /v1/rerank wrap the real embed()/rerank() in `async with pool.acquire(model_id)`. LLM/STT/TTS/STS handlers unchanged. jundot#1595 (counter leak): the immediate-abort eviction stops a non-streaming engine without the per-request completion callbacks, leaking _active_count so a torn-down engine looks permanently busy. Fix: BaseNonStreamingEngine._reset_activity_tracking() zeros the counter + clears activities, called (getattr-guarded) from EnginePool._unload_engine() after stop(). Idle-engine eviction is fully preserved (only genuinely in-use/leased engines are protected). Adds regression tests in tests/test_engine_pool.py and updates the integration MockEnginePool to mirror the new lease API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Quick context for review, in case it helps prioritize: This was the root cause behind recurring "embedder degraded under load" behavior — the small embedding engine consistently loses the eviction lottery during the acquire→use window, so it surfaces as the embedder even though it's a general engine-pool race. It pairs with #1595 (the counter-leak sibling): both come from eviction being gated only by A few things that should make it low-risk to review:
Happy to split this into two PRs (#1667 race vs #1595 counter), rename anything to match your conventions, or add cases you'd like covered. |
|
Thanks for the extra context. The acquire-to-use race you described matches what I see in the code: non-streaming engines only register activity after the started guard, so has_active_requests() cannot protect an engine handle that was just returned by get_engine(). I checked the lease path against the memory-enforcer eviction paths and the embeddings/rerank server wrappers. Taking the lease under the pool lock and skipping leased entries in LRU, prefill, and TTL eviction closes that window while preserving idle eviction. I also verified the teardown reset for non-streaming activity counters. The added regression coverage targets the right invariants, and my focused local runs passed. This looks good to me. |
Summary
Fixes two related memory-enforcer bugs that intermittently 500
/v1/embeddingsand/v1/rerankwith:Both stem from the same gap: eviction is gated only by
has_active_requests(), which is blind during the window between acquiring an engine and registering activity on it.Fixes #1667. Closely related to #1595 (this PR also defuses it).
Bug A — acquire-vs-use eviction race (#1667)
EnginePool.get_engine()returns the engine handle, then releases the pool lock. The request handler then runs outside the lock (engine = await get_embedding_engine(...)... laterawait engine.embed(...)). For the non-streaming engines (embedding/reranker), the in-use signal (_active_countvia_begin_activity) is registered only after theif self._model is None: raise RuntimeError("Engine not started...")guard insideembed()/rerank(). Sohas_active_requests()returnsFalsefor the entire span from acquire until_begin_activity.Meanwhile the
ProcessMemoryEnforcer's ~1s poll takes the same pool lock and_find_lru_victim()/_is_idle_for_prefill_eviction()/check_ttl_expirations()see the engine as idle, so they evict it. The suspended request resumes against a torn-down engine and raises. The embedding engine (smallest, usually oldestlast_access) is the usual victim — the recurring cause of "embedder degraded" episodes for downstream consumers.Bug B — active-request counter leak (#1595)
The immediate-abort eviction stops a non-streaming engine without the normal per-request completion callbacks (
_end_activity), so_active_countcan be left non-zero — a phantom "busy" count that corrupts the status API and (viahas_active_requests()) can make a stale engine look permanently non-evictable.Fix — atomic in-use lease + teardown reset
Take the in-use signal at acquire time, under the pool lock, so eviction can never tear down an engine a request holds or is acquiring:
EngineEntry.in_use: int = 0refcount.get_engine(..., _lease: bool = False)incrementsentry.in_useunderself._lockat both return points (already-loaded + post-load).release_engine(model_id)decrements under the lock (floored at 0);acquire(model_id)@asynccontextmanagerleases on enter and always releases infinally(so a request error can't leak the lease).in_use > 0:_find_lru_victim,_is_idle_for_prefill_eviction,check_ttl_expirations(in addition tohas_active_requests()).server.pythreads_lease/_leased_outso the exact pool-resolved model_id (incl. the default-model fallback) is captured, and releases the lease on validation failure./v1/embeddingsand/v1/rerankwrap the realembed()/rerank()call inasync with pool.acquire(model_id)(the embeddings lease is taken inside the deferredStreamingResponsecoroutine, which is where the engine is actually used). LLM/STT/TTS/STS handlers are unchanged. The non-lease path keeps the originalpool.get_engine(model_id)call contract.BaseNonStreamingEngine._reset_activity_tracking()zeros_active_countand clears_activitiesunder_active_lock, called (getattr-guarded) fromEnginePool._unload_engine()afterstop().This closes both sub-windows (acquire→activity, and the guard→increment gap inside
embed()) and preserves eviction of genuinely idle engines — only engines that are leased/in-use are protected.Testing
Tests run with the repo's
pytestsuites (Python 3.11, mlx available):tests/test_engine_pool.py+tests/test_embedding.py: 154 passed (incl. 10 new regression tests).TestEnginePoolInUseLease,TestResetActivityTracking): the three eviction paths skipin_use > 0;acquire()leases then releases on success and on exception;release_enginefloors at 0;_reset_activity_trackingclears a leaked counter;_unload_engineinvokes the reset on teardown. Verified these fail against the pre-fix code and pass with the fix (non-vacuous).tests/test_server.py,tests/test_rerank_models.py, integration endpoint tests): green. The integrationMockEnginePoolwas updated to mirror the new lease API (_leasekwarg + no-oprelease_engine).-m "not slow and not integration"plus the explicit integration endpoint tests): green except two pre-existingtest_install_detection.pyfailures that are environment artifacts of running under a Homebrew-installed interpreter (is_homebrew()detects the Cellar path); they fail identically on unmodifiedmainand are unrelated to this change.🤖 Generated with Claude Code