fix(model_tools): cancel coroutine on timeout so worker thread exits + full traceback on tool failure by teknium1 · Pull Request #17428 · NousResearch/hermes-agent

teknium1 · 2026-04-29T11:56:53Z

Tool timeouts (300s default) no longer leak worker threads, and tool failures now land in errors.log with a full traceback.

Root cause of the leak: _run_async()'s running-loop branch bridges sync tool handlers by spawning a worker thread and doing future.result(timeout=300). When a coroutine ran past the timeout, future.cancel() was called — but ThreadPoolExecutor.cancel() is a no-op on a running future (it only cancels not-yet-started work). pool.shutdown(wait=False, cancel_futures=True) let the caller proceed, but the worker thread kept running the coroutine until it returned on its own. Every tool timeout leaked one thread; cumulative in long-lived gateway / RL sessions.

Changes

model_tools.py _run_async(): replace bare asyncio.run submit with a worker wrapper that owns its event loop. On timeout, schedule task.cancel() on that loop via call_soon_threadsafe, then pool.shutdown(wait=False) so the caller returns immediately. Coroutine observes CancelledError at its next await and the worker exits cleanly.
model_tools.py handle_function_call(): logger.error → logger.exception so tool failures produce a full traceback in errors.log.
tests/test_model_tools_async_bridge.py: update test_timeout_uses_nonblocking_executor_shutdown to match the new shutdown semantics (must stay wait=False; must submit a worker wrapper not bare asyncio.run). Add test_timeout_cancels_coroutine_in_worker_loop that asserts the coroutine actually sees CancelledError on timeout and the caller returns in under 3s.
scripts/release.py: add AUTHOR_MAP entry for @0z1-ghb.

Why not the original PR's approach

#17420 proposed pool.shutdown(wait=True) in both the TimeoutError branch and the finally. That converts the leak into a hang: the caller would block indefinitely waiting for the same stuck coroutine that just timed out. E2E-verified: with a coroutine that runs a 30s time.sleep, wait=True blocks the caller for the full 30s; wait=False with in-loop cancel returns in ~1s.

Validation

Targeted test pass (12/12):

scripts/run_tests.sh tests/test_model_tools_async_bridge.py
========================  12 passed, 1 warning in 1.86s  ========================

E2E exercised outside the test suite:

Scenario	Before (leak)	PR #17420 (hang)	This fix
Cancellable coro past timeout	caller returns; worker thread leaks until coro finishes	caller hangs ~= duration of stuck coro	caller returns ~= timeout; coro sees CancelledError; worker exits
Uncancellable blocking coro (raw `time.sleep`)	caller returns; worker leaks	caller hangs	caller returns; worker runs to completion (no worse than before, but bounded to that one coro)

Closes #17420 (issue identified correctly; original fix would have introduced a user-visible hang).

Co-authored-by: 0z! 162235745+0z1-ghb@users.noreply.github.com

…+ log full traceback _run_async() bridges sync tool handlers to async code. When the handler is invoked from inside a running event loop (gateway / nested async), it spawns a worker thread and blocks on future.result(timeout=300). Before this change, a coroutine that ran past 300s leaked its worker thread: - future.cancel() is a no-op on a running ThreadPoolExecutor future (cancel only works on not-yet-started work). - pool.shutdown(wait=False, cancel_futures=True) let the caller proceed but the worker kept running the coroutine until it returned on its own. Every tool timeout leaked one thread. In long-lived gateway / RL sessions this is cumulative. The fix replaces bare asyncio.run() with a worker wrapper that creates its own event loop. On timeout, _run_async schedules task.cancel() on that loop via call_soon_threadsafe, then shuts the pool down with wait=False so the caller returns immediately. The coroutine observes CancelledError at its next await and the worker thread exits cleanly. Also switches logger.error() to logger.exception() in the top-level handle_function_call() except block so tool failures produce full stack traces in errors.log instead of just the message. Related: #17420 (contributor flagged the leak; the original fix used pool.shutdown(wait=True) which would have converted the leak into a hang — caller blocks forever on the same stuck coroutine). Credit for identifying the leak goes to the contributor. Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>

teknium1 merged commit b0435cc into main Apr 29, 2026
11 of 12 checks passed

teknium1 deleted the hermes/hermes-d8f0307e branch April 29, 2026 12:00

teknium1 mentioned this pull request Apr 29, 2026

fix(core): resolve thread leak, lost traceback, and type violation #17420

Closed

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/tools Tool registry, model_tools, toolsets labels Apr 29, 2026

chezzdev mentioned this pull request Apr 30, 2026

fix(model-tools): reuse persistent async bridge loop #16573

Open

11 tasks

github-actions Bot mentioned this pull request May 1, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.23 to v2026.4.30 Docker-Hub-sirmark/docker-hermes-agent#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(model_tools): cancel coroutine on timeout so worker thread exits + full traceback on tool failure#17428

fix(model_tools): cancel coroutine on timeout so worker thread exits + full traceback on tool failure#17428
teknium1 merged 1 commit into
mainfrom
hermes/hermes-d8f0307e

teknium1 commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented Apr 29, 2026

Changes

Why not the original PR's approach

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants