Skip to content

fix(model_tools): cancel coroutine on timeout so worker thread exits + full traceback on tool failure#17428

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-d8f0307e
Apr 29, 2026
Merged

fix(model_tools): cancel coroutine on timeout so worker thread exits + full traceback on tool failure#17428
teknium1 merged 1 commit into
mainfrom
hermes/hermes-d8f0307e

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Tool timeouts (300s default) no longer leak worker threads, and tool failures now land in errors.log with a full traceback.

Root cause of the leak: _run_async()'s running-loop branch bridges sync tool handlers by spawning a worker thread and doing future.result(timeout=300). When a coroutine ran past the timeout, future.cancel() was called — but ThreadPoolExecutor.cancel() is a no-op on a running future (it only cancels not-yet-started work). pool.shutdown(wait=False, cancel_futures=True) let the caller proceed, but the worker thread kept running the coroutine until it returned on its own. Every tool timeout leaked one thread; cumulative in long-lived gateway / RL sessions.

Changes

  • model_tools.py _run_async(): replace bare asyncio.run submit with a worker wrapper that owns its event loop. On timeout, schedule task.cancel() on that loop via call_soon_threadsafe, then pool.shutdown(wait=False) so the caller returns immediately. Coroutine observes CancelledError at its next await and the worker exits cleanly.
  • model_tools.py handle_function_call(): logger.errorlogger.exception so tool failures produce a full traceback in errors.log.
  • tests/test_model_tools_async_bridge.py: update test_timeout_uses_nonblocking_executor_shutdown to match the new shutdown semantics (must stay wait=False; must submit a worker wrapper not bare asyncio.run). Add test_timeout_cancels_coroutine_in_worker_loop that asserts the coroutine actually sees CancelledError on timeout and the caller returns in under 3s.
  • scripts/release.py: add AUTHOR_MAP entry for @0z1-ghb.

Why not the original PR's approach

#17420 proposed pool.shutdown(wait=True) in both the TimeoutError branch and the finally. That converts the leak into a hang: the caller would block indefinitely waiting for the same stuck coroutine that just timed out. E2E-verified: with a coroutine that runs a 30s time.sleep, wait=True blocks the caller for the full 30s; wait=False with in-loop cancel returns in ~1s.

Validation

Targeted test pass (12/12):

scripts/run_tests.sh tests/test_model_tools_async_bridge.py
========================  12 passed, 1 warning in 1.86s  ========================

E2E exercised outside the test suite:

Scenario Before (leak) PR #17420 (hang) This fix
Cancellable coro past timeout caller returns; worker thread leaks until coro finishes caller hangs ~= duration of stuck coro caller returns ~= timeout; coro sees CancelledError; worker exits
Uncancellable blocking coro (raw time.sleep) caller returns; worker leaks caller hangs caller returns; worker runs to completion (no worse than before, but bounded to that one coro)

Closes #17420 (issue identified correctly; original fix would have introduced a user-visible hang).

Co-authored-by: 0z! 162235745+0z1-ghb@users.noreply.github.com

…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: #17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
@teknium1 teknium1 merged commit b0435cc into main Apr 29, 2026
11 of 12 checks passed
@teknium1 teknium1 deleted the hermes/hermes-d8f0307e branch April 29, 2026 12:00
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/tools Tool registry, model_tools, toolsets labels Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants