Skip to content

Fix engine-pool acquire-vs-use eviction race (#1667) + active-request counter leak (#1595)#1668

Merged
jundot merged 1 commit into
jundot:mainfrom
Cmerrill1713:fix/engine-pool-acquire-use-lease
Jun 5, 2026
Merged

Fix engine-pool acquire-vs-use eviction race (#1667) + active-request counter leak (#1595)#1668
jundot merged 1 commit into
jundot:mainfrom
Cmerrill1713:fix/engine-pool-acquire-use-lease

Conversation

@Cmerrill1713

Copy link
Copy Markdown
Contributor

Summary

Fixes two related memory-enforcer bugs that intermittently 500 /v1/embeddings and /v1/rerank with:

RuntimeError: Engine not started. Call start() first.

Both stem from the same gap: eviction is gated only by has_active_requests(), which is blind during the window between acquiring an engine and registering activity on it.

Fixes #1667. Closely related to #1595 (this PR also defuses it).

Bug A — acquire-vs-use eviction race (#1667)

EnginePool.get_engine() returns the engine handle, then releases the pool lock. The request handler then runs outside the lock (engine = await get_embedding_engine(...) ... later await engine.embed(...)). For the non-streaming engines (embedding/reranker), the in-use signal (_active_count via _begin_activity) is registered only after the if self._model is None: raise RuntimeError("Engine not started...") guard inside embed()/rerank(). So has_active_requests() returns False for the entire span from acquire until _begin_activity.

Meanwhile the ProcessMemoryEnforcer's ~1s poll takes the same pool lock and _find_lru_victim() / _is_idle_for_prefill_eviction() / check_ttl_expirations() see the engine as idle, so they evict it. The suspended request resumes against a torn-down engine and raises. The embedding engine (smallest, usually oldest last_access) is the usual victim — the recurring cause of "embedder degraded" episodes for downstream consumers.

Bug B — active-request counter leak (#1595)

The immediate-abort eviction stops a non-streaming engine without the normal per-request completion callbacks (_end_activity), so _active_count can be left non-zero — a phantom "busy" count that corrupts the status API and (via has_active_requests()) can make a stale engine look permanently non-evictable.

Fix — atomic in-use lease + teardown reset

Take the in-use signal at acquire time, under the pool lock, so eviction can never tear down an engine a request holds or is acquiring:

  • EngineEntry.in_use: int = 0 refcount.
  • get_engine(..., _lease: bool = False) increments entry.in_use under self._lock at both return points (already-loaded + post-load).
  • release_engine(model_id) decrements under the lock (floored at 0); acquire(model_id) @asynccontextmanager leases on enter and always releases in finally (so a request error can't leak the lease).
  • All three eviction paths skip in_use > 0: _find_lru_victim, _is_idle_for_prefill_eviction, check_ttl_expirations (in addition to has_active_requests()).
  • server.py threads _lease / _leased_out so the exact pool-resolved model_id (incl. the default-model fallback) is captured, and releases the lease on validation failure. /v1/embeddings and /v1/rerank wrap the real embed()/rerank() call in async with pool.acquire(model_id) (the embeddings lease is taken inside the deferred StreamingResponse coroutine, which is where the engine is actually used). LLM/STT/TTS/STS handlers are unchanged. The non-lease path keeps the original pool.get_engine(model_id) call contract.
  • Memory enforcer eviction path leaks active_requests counter #1595: BaseNonStreamingEngine._reset_activity_tracking() zeros _active_count and clears _activities under _active_lock, called (getattr-guarded) from EnginePool._unload_engine() after stop().

This closes both sub-windows (acquire→activity, and the guard→increment gap inside embed()) and preserves eviction of genuinely idle engines — only engines that are leased/in-use are protected.

Testing

Tests run with the repo's pytest suites (Python 3.11, mlx available):

  • tests/test_engine_pool.py + tests/test_embedding.py: 154 passed (incl. 10 new regression tests).
  • New regression tests (TestEnginePoolInUseLease, TestResetActivityTracking): the three eviction paths skip in_use > 0; acquire() leases then releases on success and on exception; release_engine floors at 0; _reset_activity_tracking clears a leaked counter; _unload_engine invokes the reset on teardown. Verified these fail against the pre-fix code and pass with the fix (non-vacuous).
  • Server/rerank/embeddings suites (tests/test_server.py, tests/test_rerank_models.py, integration endpoint tests): green. The integration MockEnginePool was updated to mirror the new lease API (_lease kwarg + no-op release_engine).
  • Full default suite (-m "not slow and not integration" plus the explicit integration endpoint tests): green except two pre-existing test_install_detection.py failures that are environment artifacts of running under a Homebrew-installed interpreter (is_homebrew() detects the Cellar path); they fail identically on unmodified main and are unrelated to this change.

🤖 Generated with Claude Code

…equest counter leak (jundot#1595)

Two related memory-enforcer bugs that intermittently 500 /v1/embeddings and
/v1/rerank with `RuntimeError: Engine not started. Call start() first.`

jundot#1667 (acquire-vs-use race): EnginePool.get_engine() returns the engine then
releases the pool lock. For non-streaming engines (embedding/reranker) the
in-use signal (_active_count via _begin_activity) is registered only AFTER the
started guard inside embed()/rerank(), so has_active_requests() reads False for
the whole acquire->use window. The ~1s memory enforcer poll can take the same
lock and evict the idle-looking engine mid-request.

Fix: an atomic in-use lease taken under the pool lock at acquire time.
- EngineEntry.in_use refcount; get_engine(_lease=True) increments at both
  return points; release_engine() decrements (floored at 0); acquire()
  asynccontextmanager leases and always releases in finally (exception-safe).
- All three eviction paths skip in_use > 0: _find_lru_victim,
  _is_idle_for_prefill_eviction, check_ttl_expirations.
- server.py threads _lease/_leased_out so the EXACT pool-resolved model_id
  (incl. default-model fallback) is captured; releases on validation failure;
  /v1/embeddings and /v1/rerank wrap the real embed()/rerank() in
  `async with pool.acquire(model_id)`. LLM/STT/TTS/STS handlers unchanged.

jundot#1595 (counter leak): the immediate-abort eviction stops a non-streaming
engine without the per-request completion callbacks, leaking _active_count so a
torn-down engine looks permanently busy. Fix:
BaseNonStreamingEngine._reset_activity_tracking() zeros the counter + clears
activities, called (getattr-guarded) from EnginePool._unload_engine() after
stop().

Idle-engine eviction is fully preserved (only genuinely in-use/leased engines
are protected). Adds regression tests in tests/test_engine_pool.py and updates
the integration MockEnginePool to mirror the new lease API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Cmerrill1713

Copy link
Copy Markdown
Contributor Author

Quick context for review, in case it helps prioritize:

This was the root cause behind recurring "embedder degraded under load" behavior — the small embedding engine consistently loses the eviction lottery during the acquire→use window, so it surfaces as the embedder even though it's a general engine-pool race. It pairs with #1595 (the counter-leak sibling): both come from eviction being gated only by has_active_requests(), which is blind between acquiring an engine and registering its activity. This PR closes both with one mechanism — an atomic in_use lease taken under the pool lock at acquire and released in finally, plus a counter reset on teardown.

A few things that should make it low-risk to review:

  • Ported clean to current HEAD (0.4.2.dev2) — the three source files were byte-identical to 0.4.1.
  • Full suite green: 4,967 passed / 37 skipped (the only 2 failures are pre-existing test_install_detection env artifacts that fail on unmodified main too).
  • 10 new regression tests are non-vacuous — they fail on pristine source and pass with the patch (eviction skips a leased entry across all three paths; acquire() releases on success and exception; the counter resets on stop()).
  • Backward-compatible: _lease is only threaded when actually leasing, so existing get_engine callers and test mocks are unaffected; idle-engine eviction and the memory-guard watermarks are unchanged.

Happy to split this into two PRs (#1667 race vs #1595 counter), rename anything to match your conventions, or add cases you'd like covered.

@jundot

jundot commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Thanks for the extra context. The acquire-to-use race you described matches what I see in the code: non-streaming engines only register activity after the started guard, so has_active_requests() cannot protect an engine handle that was just returned by get_engine().

I checked the lease path against the memory-enforcer eviction paths and the embeddings/rerank server wrappers. Taking the lease under the pool lock and skipping leased entries in LRU, prefill, and TTL eviction closes that window while preserving idle eviction. I also verified the teardown reset for non-streaming activity counters.

The added regression coverage targets the right invariants, and my focused local runs passed. This looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Engine pool evicts an acquired-but-not-yet-active engine → RuntimeError: Engine not started (acquire→use race; sibling of #1595)

2 participants