fix(model-tools): reuse persistent async bridge loop#16573
Open
chezzdev wants to merge 6 commits into
Open
Conversation
382152c to
ea06a63
Compare
c26200d to
d5f448c
Compare
d5f448c to
41e0889
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes the
model_tools._run_async()running-loop branch so gateway/async callers reuse a persistent bridge event loop instead of creating and closing a freshasyncio.run()loop per call.The old running-loop branch created a disposable worker thread and event loop for every async-context tool call. Cached async clients such as
AsyncOpenAI/httpx can outlive those short-lived loops, which creates stale-loop cleanup hazards and descriptor churn in long-lived gateway processes.Fixes #16570.
Architectural note
This PR keeps the architectural direction aligned with the rest of
model_tools._run_async(): main-thread and worker-thread paths already use persistent loops, and this change gives running-loop callers the same stable-loop behavior viaasyncio.run_coroutine_threadsafe().It also preserves the important timeout-cancellation behavior added by #17428. That PR fixed the "timed-out coroutine keeps running in a worker thread" bug, but kept the per-call worker-loop bridge. This PR keeps cancellation semantics while removing the per-call loop churn that #16570 reports.
e7feb84independently attempted the same dedicated bridge-loop shape, which is useful signal that the architecture matches the issue. This implementation hardens that idea with locking, startup failure cleanup, explicit CLI/gateway shutdown, nested-call protection, health-checked bridge retirement, cancellation-safe one-off fallback, retired-loop closure, and regression coverage.Related work checked
e7feb84: referenced from [Bug]: model_tools async bridge recreates loops in running-loop contexts #16570; same dedicated bridge-loop direction, but without the lifecycle and cancellation hardening in this PR.Changes Made
model_tools.py: add a dedicated persistent async bridge loop for callers already inside a running event loop.model_tools.py: add startup failure handling so a dead bridge thread is not published as usable state.model_tools.py: avoid self-deadlock when_run_async()is called from code already executing on the bridge loop by using a cancellation-safe one-off worker loop.model_tools.py: on timeout, cancel the scheduled bridge future and retire the shared bridge only if a health probe shows the bridge loop is stuck.model_tools.py: close retired bridge loops after their bridge thread exits, and keep normal shutdown from double-closing thread-owned loops.cli.pyandgateway/run.py: callshutdown_async_bridge_loop()from existing cleanup paths after cached auxiliary clients are closed.How to Test
Checklist
Code
pytest tests/ -qand all tests passDocumentation & Housekeeping
cli-config.yaml.exampleif I added/changed config keys -- N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows -- N/A