Skip to content

fix(core): resolve thread leak, lost traceback, and type violation#17420

Closed
0z1-ghb wants to merge 2 commits into
NousResearch:mainfrom
0z1-ghb:fix/thread-leak-and-traceback
Closed

fix(core): resolve thread leak, lost traceback, and type violation#17420
0z1-ghb wants to merge 2 commits into
NousResearch:mainfrom
0z1-ghb:fix/thread-leak-and-traceback

Conversation

@0z1-ghb

@0z1-ghb 0z1-ghb commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR addresses three distinct bugs in the core tool execution and redaction paths:

  1. A ThreadPoolExecutor thread leak in _run_coro_in_fresh_thread where timed-out coroutines continued running in orphaned threads.
  2. A lost traceback in handle_function_call where logger.error() discarded the full stack trace.
  3. A type contract violation in redact_sensitive_text where None was returned despite a -> str annotation.

Key Changes

  • model_tools.py — Changed pool.shutdown(wait=False, cancel_futures=True) to pool.shutdown(wait=True). The previous wait=False caused worker threads to be orphaned on timeout, leaking threads on every timeout event.

  • model_tools.py — Replaced logger.error() with logger.exception(). Ensures full traceback is written to errors.log when a tool handler raises an unexpected exception.

  • agent/redact.py — Changed return None to return "". The function signature declares -> str, so returning None violated the type contract and could cause AttributeError in callers.

Impact

  • Resource stability: Eliminates cumulative thread leaks in long-running gateway/RL sessions.
  • Debuggability: Tool failures now produce actionable stack traces.
  • Type safety: redact_sensitive_text now honors its type contract.

All changes are backward-compatible.

teknium1 added a commit that referenced this pull request Apr 29, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: #17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for catching the thread leak — that diagnosis was correct. A reworked fix is in #17428 (merged b0435cc) with you as co-author.

Why I didn't take the patch as-submitted: pool.shutdown(wait=True) in the TimeoutError branch (and in finally) would have blocked the caller indefinitely on the same stuck coroutine that just timed out, converting the thread leak into a user-visible hang. future.cancel() is a no-op on a running ThreadPoolExecutor future, so there was no cancellation to wait on — the caller would just wait for the coro to finish naturally.

The new fix addresses the underlying issue: _run_async() now submits a worker wrapper that owns its event loop, so on timeout we schedule task.cancel() on that loop via call_soon_threadsafe and shut down with wait=False. The coroutine observes CancelledError at its next await and the worker exits cleanly — no leak, no hang. Your logger.error → logger.exception change is in there too. Your commit is referenced via Co-authored-by: trailer.

Skipped the return None → return "" change in redact.py; no caller passes None in the current tree and returning an empty string silently drops the absence signal.

@0z1-ghb

0z1-ghb commented Apr 29, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed explanation and the fix. Appreciate the co-author credit and including the logger change

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets labels Apr 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Superseded by #17428 (merged) — same thread-leak fix but with correct cancel-coroutine approach instead of wait=True which would hang.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Superseded by #17428

donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: NousResearch#17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: NousResearch#17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: NousResearch#17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: NousResearch#17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: NousResearch#17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: NousResearch#17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
Seven74AI pushed a commit to Seven74AI/hermes-agent that referenced this pull request Jun 13, 2026
…+ log full traceback

_run_async() bridges sync tool handlers to async code. When the handler
is invoked from inside a running event loop (gateway / nested async),
it spawns a worker thread and blocks on future.result(timeout=300).

Before this change, a coroutine that ran past 300s leaked its worker
thread:

  - future.cancel() is a no-op on a running ThreadPoolExecutor future
    (cancel only works on not-yet-started work).
  - pool.shutdown(wait=False, cancel_futures=True) let the caller
    proceed but the worker kept running the coroutine until it
    returned on its own.

Every tool timeout leaked one thread. In long-lived gateway / RL
sessions this is cumulative.

The fix replaces bare asyncio.run() with a worker wrapper that
creates its own event loop. On timeout, _run_async schedules
task.cancel() on that loop via call_soon_threadsafe, then shuts the
pool down with wait=False so the caller returns immediately. The
coroutine observes CancelledError at its next await and the worker
thread exits cleanly.

Also switches logger.error() to logger.exception() in the top-level
handle_function_call() except block so tool failures produce full
stack traces in errors.log instead of just the message.

Related: NousResearch#17420 (contributor flagged the leak; the original fix used
pool.shutdown(wait=True) which would have converted the leak into a
hang — caller blocks forever on the same stuck coroutine). Credit
for identifying the leak goes to the contributor.

Co-authored-by: 0z! <162235745+0z1-ghb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants