fix: add atexit cleanup and timeout for event loop in model_tools#8073
fix: add atexit cleanup and timeout for event loop in model_tools#8073tomqiaozc wants to merge 1 commit into
Conversation
…usResearch#8043) The persistent _tool_loop was never closed on interpreter exit, and _run_async() had no timeout, allowing leaked tasks. Add atexit handler to close the loop and wrap future.result() with a 300s timeout that cancels the future on expiry. Closes NousResearch#8043 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
iRonin
left a comment
There was a problem hiding this comment.
Review: mostly good but 4 concerns
1. loop.stop() via call_soon_threadsafe is ineffective at exit
_cleanup_tool_loop calls loop.call_soon_threadsafe(loop.stop) but nothing ever run()s the loop again to process the stop signal. At interpreter shutdown the loop gets garbage collected anyway, so this is effectively a no-op. If the intent is to release resources, loop.close() directly would be more honest — or just acknowledge this is a signal, not actual cleanup.
2. future.cancel() does NOT cancel the asyncio task
The comment in the PR acknowledges this ("we can't reach that task directly") but the docstring and error message imply the coroutine gets cancelled. future.cancel() only cancels the concurrent.futures.Future wrapper. The thread running asyncio.run(coro) continues executing until the coroutine finishes or the process exits. The 300s timeout only prevents the caller from blocking — it does not free the underlying task's resources. Consider making the error message more honest: e.g. "coroutine did not complete within 300s (background thread continues running)".
3. Missing _worker_thread_local cleanup
Issue #8043 explicitly calls out _worker_thread_local loops as a second leak vector. This PR only addresses _tool_loop. When the thread pool shrinks or threads are recycled, the thread-local loops and their bound httpx connections are never closed. An atexit handler won't help here since thread-local storage is per-thread. Consider a weak-value dict mapping threads to their loops, with cleanup on thread exit.
4. _cleanup_tool_loop fires even when no loop was created
The atexit.register() at module import time means the handler fires even if model_tools is imported just to read schemas (no loop ever created). The if _tool_loop is not None guard prevents a crash, but it's still noise. Consider registering lazily — only call atexit.register(_cleanup_tool_loop) the first time _get_tool_loop() actually creates a loop.
- Add _cleanup_all_loops() with loop.close() (not just loop.stop()) - Track worker thread loops in _worker_loops dict for atexit cleanup - Lazy atexit registration — only fires when a loop is actually created - Honest timeout error message: (background thread continues running) Improves on upstream PR NousResearch#8073 by: - Using loop.close() instead of call_soon_threadsafe(stop) which is ineffective at exit since nothing runs the loop again - Tracking worker thread loops (missing from the original PR) - Registering atexit handler lazily (not at import time)
|
Thanks for the contribution @tomqiaozc — identifying the This is an automated hermes-sweeper review. The underlying bugs have since been fixed on
Current Closing as implemented on main. |
Summary
_tool_loopwas never closed on interpreter exit, leaking resources_run_async()had no timeout — leaked tasks could hang indefinitelyatexithandler to close the loop and 300s timeout withfuture.cancel()on expiryTest plan
Closes #8043
🤖 Generated with Claude Code