fix(model_tools): cancel coroutine on timeout so worker thread exits + full traceback on tool failure#17428
Merged
Merged
Conversation
…+ log full traceback
_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).
Before this change, a coroutine that ran past 300s leaked its worker
thread:
- future.cancel() is a no-op on a running ThreadPoolExecutor future
(cancel only works on not-yet-started work).
- pool.shutdown(wait=False, cancel_futures=True) let the caller
proceed but the worker kept running the coroutine until it
returned on its own.
Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.
The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.
Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.
Related: #17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.
Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tool timeouts (300s default) no longer leak worker threads, and tool failures now land in errors.log with a full traceback.
Root cause of the leak:
_run_async()'s running-loop branch bridges sync tool handlers by spawning a worker thread and doingfuture.result(timeout=300). When a coroutine ran past the timeout,future.cancel()was called — butThreadPoolExecutor.cancel()is a no-op on a running future (it only cancels not-yet-started work).pool.shutdown(wait=False, cancel_futures=True)let the caller proceed, but the worker thread kept running the coroutine until it returned on its own. Every tool timeout leaked one thread; cumulative in long-lived gateway / RL sessions.Changes
model_tools.py_run_async(): replace bareasyncio.runsubmit with a worker wrapper that owns its event loop. On timeout, scheduletask.cancel()on that loop viacall_soon_threadsafe, thenpool.shutdown(wait=False)so the caller returns immediately. Coroutine observesCancelledErrorat its nextawaitand the worker exits cleanly.model_tools.pyhandle_function_call():logger.error→logger.exceptionso tool failures produce a full traceback in errors.log.tests/test_model_tools_async_bridge.py: updatetest_timeout_uses_nonblocking_executor_shutdownto match the new shutdown semantics (must staywait=False; must submit a worker wrapper not bareasyncio.run). Addtest_timeout_cancels_coroutine_in_worker_loopthat asserts the coroutine actually seesCancelledErroron timeout and the caller returns in under 3s.scripts/release.py: add AUTHOR_MAP entry for @0z1-ghb.Why not the original PR's approach
#17420 proposed
pool.shutdown(wait=True)in both the TimeoutError branch and thefinally. That converts the leak into a hang: the caller would block indefinitely waiting for the same stuck coroutine that just timed out. E2E-verified: with a coroutine that runs a 30stime.sleep,wait=Trueblocks the caller for the full 30s;wait=Falsewith in-loop cancel returns in ~1s.Validation
Targeted test pass (12/12):
E2E exercised outside the test suite:
time.sleep)Closes #17420 (issue identified correctly; original fix would have introduced a user-visible hang).
Co-authored-by: 0z! 162235745+0z1-ghb@users.noreply.github.com