fix: add atexit cleanup and timeout for event loop in model_tools by tomqiaozc · Pull Request #8073 · NousResearch/hermes-agent

tomqiaozc · 2026-04-12T01:41:00Z

Summary

The persistent _tool_loop was never closed on interpreter exit, leaking resources
_run_async() had no timeout — leaked tasks could hang indefinitely
Add atexit handler to close the loop and 300s timeout with future.cancel() on expiry

Test plan

5 new tests: 3 for atexit cleanup, 2 for timeout behavior
All 12 existing model_tools tests pass
All 15 existing async bridge tests pass

🤖 Generated with Claude Code

…usResearch#8043) The persistent _tool_loop was never closed on interpreter exit, and _run_async() had no timeout, allowing leaked tasks. Add atexit handler to close the loop and wrap future.result() with a 300s timeout that cancels the future on expiry. Closes NousResearch#8043 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

iRonin

Review: mostly good but 4 concerns

1. `loop.stop()` via `call_soon_threadsafe` is ineffective at exit

_cleanup_tool_loop calls loop.call_soon_threadsafe(loop.stop) but nothing ever run()s the loop again to process the stop signal. At interpreter shutdown the loop gets garbage collected anyway, so this is effectively a no-op. If the intent is to release resources, loop.close() directly would be more honest — or just acknowledge this is a signal, not actual cleanup.

2. `future.cancel()` does NOT cancel the asyncio task

The comment in the PR acknowledges this ("we can't reach that task directly") but the docstring and error message imply the coroutine gets cancelled. future.cancel() only cancels the concurrent.futures.Future wrapper. The thread running asyncio.run(coro) continues executing until the coroutine finishes or the process exits. The 300s timeout only prevents the caller from blocking — it does not free the underlying task's resources. Consider making the error message more honest: e.g. "coroutine did not complete within 300s (background thread continues running)".

3. Missing `_worker_thread_local` cleanup

Issue #8043 explicitly calls out _worker_thread_local loops as a second leak vector. This PR only addresses _tool_loop. When the thread pool shrinks or threads are recycled, the thread-local loops and their bound httpx connections are never closed. An atexit handler won't help here since thread-local storage is per-thread. Consider a weak-value dict mapping threads to their loops, with cleanup on thread exit.

4. `_cleanup_tool_loop` fires even when no loop was created

The atexit.register() at module import time means the handler fires even if model_tools is imported just to read schemas (no loop ever created). The if _tool_loop is not None guard prevents a crash, but it's still noise. Consider registering lazily — only call atexit.register(_cleanup_tool_loop) the first time _get_tool_loop() actually creates a loop.

- Add _cleanup_all_loops() with loop.close() (not just loop.stop()) - Track worker thread loops in _worker_loops dict for atexit cleanup - Lazy atexit registration — only fires when a loop is actually created - Honest timeout error message: (background thread continues running) Improves on upstream PR NousResearch#8073 by: - Using loop.close() instead of call_soon_threadsafe(stop) which is ineffective at exit since nothing runs the loop again - Tracking worker thread loops (missing from the original PR) - Registering atexit handler lazily (not at import time)

teknium1 · 2026-06-10T08:34:09Z

Thanks for the contribution @tomqiaozc — identifying the _tool_loop resource leak and missing timeout in _run_async() was genuinely useful, and the reviewer discussion surfaced important correctness issues.

This is an automated hermes-sweeper review.

The underlying bugs have since been fixed on main with a more correct implementation:

76c454914 (v2026.4.23) — added non-blocking executor shutdown on async timeout, fixing the thread-leak path you identified.
b0435cc16 (v2026.4.30) — rewrote _run_async() to use a worker-owned event loop, cancel tasks via call_soon_threadsafe(t.cancel) on timeout, and shut down the pool with wait=False. The commit message explicitly credits your PR for identifying the leak. This directly addresses all four concerns raised by @iRonin: future.cancel() being a no-op on running futures, loop.stop() ineffectiveness, worker-thread-local loop cleanup (handled in the finally block), and eager atexit registration (not needed with structural cleanup).

Current main has no atexit handler in model_tools.py — lifecycle is managed structurally inside _run_in_worker's finally block instead, which is the approach @iRonin recommended.

Closing as implemented on main.

iRonin reviewed Apr 14, 2026

View reviewed changes

iRonin mentioned this pull request Apr 14, 2026

fix: close leaked cron and gateway resources #8331

Open

3 tasks

morningto mentioned this pull request Apr 15, 2026

Bug: auxiliary_client._client_cache accumulates AsyncOpenAI clients indefinitely, causing fd exhaustion in long-running gateway #10200

Closed

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/tools Tool registry, model_tools, toolsets labels Apr 28, 2026

teknium1 closed this Jun 10, 2026

teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add atexit cleanup and timeout for event loop in model_tools#8073

fix: add atexit cleanup and timeout for event loop in model_tools#8073
tomqiaozc wants to merge 1 commit into
NousResearch:mainfrom
tomqiaozc:fix/8043-event-loop-cleanup

tomqiaozc commented Apr 12, 2026

Uh oh!

iRonin left a comment

Uh oh!

teknium1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tomqiaozc commented Apr 12, 2026

Summary

Test plan

Uh oh!

iRonin left a comment

Choose a reason for hiding this comment

Review: mostly good but 4 concerns

1. loop.stop() via call_soon_threadsafe is ineffective at exit

2. future.cancel() does NOT cancel the asyncio task

3. Missing _worker_thread_local cleanup

4. _cleanup_tool_loop fires even when no loop was created

Uh oh!

teknium1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `loop.stop()` via `call_soon_threadsafe` is ineffective at exit

2. `future.cancel()` does NOT cancel the asyncio task

3. Missing `_worker_thread_local` cleanup

4. `_cleanup_tool_loop` fires even when no loop was created