fix(core): resolve thread leak, lost traceback, and type violation by 0z1-ghb · Pull Request #17420 · NousResearch/hermes-agent

0z1-ghb · 2026-04-29T11:43:17Z

Overview

This PR addresses three distinct bugs in the core tool execution and redaction paths:

A ThreadPoolExecutor thread leak in _run_coro_in_fresh_thread where timed-out coroutines continued running in orphaned threads.
A lost traceback in handle_function_call where logger.error() discarded the full stack trace.
A type contract violation in redact_sensitive_text where None was returned despite a -> str annotation.

Key Changes

model_tools.py — Changed pool.shutdown(wait=False, cancel_futures=True) to pool.shutdown(wait=True). The previous wait=False caused worker threads to be orphaned on timeout, leaking threads on every timeout event.
model_tools.py — Replaced logger.error() with logger.exception(). Ensures full traceback is written to errors.log when a tool handler raises an unexpected exception.
agent/redact.py — Changed return None to return "". The function signature declares -> str, so returning None violated the type contract and could cause AttributeError in callers.

Impact

Resource stability: Eliminates cumulative thread leaks in long-running gateway/RL sessions.
Debuggability: Tool failures now produce actionable stack traces.
Type safety: redact_sensitive_text now honors its type contract.

All changes are backward-compatible.

…+ log full traceback _run_async() bridges sync tool handlers to async code. When the handler is invoked from inside a running event loop (gateway / nested async), it spawns a worker thread and blocks on future.result(timeout=300). Before this change, a coroutine that ran past 300s leaked its worker thread: - future.cancel() is a no-op on a running ThreadPoolExecutor future (cancel only works on not-yet-started work). - pool.shutdown(wait=False, cancel_futures=True) let the caller proceed but the worker kept running the coroutine until it returned on its own. Every tool timeout leaked one thread. In long-lived gateway / RL sessions this is cumulative. The fix replaces bare asyncio.run() with a worker wrapper that creates its own event loop. On timeout, _run_async schedules task.cancel() on that loop via call_soon_threadsafe, then shuts the pool down with wait=False so the caller returns immediately. The coroutine observes CancelledError at its next await and the worker thread exits cleanly. Also switches logger.error() to logger.exception() in the top-level handle_function_call() except block so tool failures produce full stack traces in errors.log instead of just the message. Related: #17420 (contributor flagged the leak; the original fix used pool.shutdown(wait=True) which would have converted the leak into a hang — caller blocks forever on the same stuck coroutine). Credit for identifying the leak goes to the contributor. Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>

teknium1 · 2026-04-29T12:01:03Z

Thanks for catching the thread leak — that diagnosis was correct. A reworked fix is in #17428 (merged b0435cc) with you as co-author.

Why I didn't take the patch as-submitted: pool.shutdown(wait=True) in the TimeoutError branch (and in finally) would have blocked the caller indefinitely on the same stuck coroutine that just timed out, converting the thread leak into a user-visible hang. future.cancel() is a no-op on a running ThreadPoolExecutor future, so there was no cancellation to wait on — the caller would just wait for the coro to finish naturally.

The new fix addresses the underlying issue: _run_async() now submits a worker wrapper that owns its event loop, so on timeout we schedule task.cancel() on that loop via call_soon_threadsafe and shut down with wait=False. The coroutine observes CancelledError at its next await and the worker exits cleanly — no leak, no hang. Your logger.error → logger.exception change is in there too. Your commit is referenced via Co-authored-by: trailer.

Skipped the return None → return "" change in redact.py; no caller passes None in the current tree and returning an empty string silently drops the absence signal.

0z1-ghb · 2026-04-29T12:13:23Z

Thanks for the detailed explanation and the fix. Appreciate the co-author credit and including the logger change

alt-glitch · 2026-04-29T12:44:56Z

Superseded by #17428 (merged) — same thread-leak fix but with correct cancel-coroutine approach instead of wait=True which would hang.

alt-glitch · 2026-04-29T12:46:28Z

Superseded by #17428

…+ log full traceback _run_async() bridges sync tool handlers to async code. When the handler is invoked from inside a running event loop (gateway / nested async), it spawns a worker thread and blocks on future.result(timeout=300). Before this change, a coroutine that ran past 300s leaked its worker thread: - future.cancel() is a no-op on a running ThreadPoolExecutor future (cancel only works on not-yet-started work). - pool.shutdown(wait=False, cancel_futures=True) let the caller proceed but the worker kept running the coroutine until it returned on its own. Every tool timeout leaked one thread. In long-lived gateway / RL sessions this is cumulative. The fix replaces bare asyncio.run() with a worker wrapper that creates its own event loop. On timeout, _run_async schedules task.cancel() on that loop via call_soon_threadsafe, then shuts the pool down with wait=False so the caller returns immediately. The coroutine observes CancelledError at its next await and the worker thread exits cleanly. Also switches logger.error() to logger.exception() in the top-level handle_function_call() except block so tool failures produce full stack traces in errors.log instead of just the message. Related: NousResearch#17420 (contributor flagged the leak; the original fix used pool.shutdown(wait=True) which would have converted the leak into a hang — caller blocks forever on the same stuck coroutine). Credit for identifying the leak goes to the contributor. Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>

0z1-ghb added 2 commits April 29, 2026 14:28

fix(core): resolve thread leak, lost traceback, and type violation

262315d

Update redact.py

f152263

teknium1 mentioned this pull request Apr 29, 2026

fix(model_tools): cancel coroutine on timeout so worker thread exits + full traceback on tool failure #17428

Merged

teknium1 closed this in #17428 Apr 29, 2026

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets labels Apr 29, 2026

teknium1 mentioned this pull request May 7, 2026

fix(model_tools): log async task cleanup exceptions instead of silent… #21156

Closed

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): resolve thread leak, lost traceback, and type violation#17420

fix(core): resolve thread leak, lost traceback, and type violation#17420
0z1-ghb wants to merge 2 commits into
NousResearch:mainfrom
0z1-ghb:fix/thread-leak-and-traceback

0z1-ghb commented Apr 29, 2026

Uh oh!

teknium1 commented Apr 29, 2026

Uh oh!

0z1-ghb commented Apr 29, 2026

Uh oh!

alt-glitch commented Apr 29, 2026

Uh oh!

alt-glitch commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0z1-ghb commented Apr 29, 2026

Overview

Key Changes

Impact

Uh oh!

teknium1 commented Apr 29, 2026

Uh oh!

0z1-ghb commented Apr 29, 2026

Uh oh!

alt-glitch commented Apr 29, 2026

Uh oh!

alt-glitch commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants