Skip to content

fix(model-tools): reuse persistent async bridge loop#16573

Open
chezzdev wants to merge 6 commits into
NousResearch:mainfrom
chezzdev:codex/fix-model-tools-async-bridge
Open

fix(model-tools): reuse persistent async bridge loop#16573
chezzdev wants to merge 6 commits into
NousResearch:mainfrom
chezzdev:codex/fix-model-tools-async-bridge

Conversation

@chezzdev

@chezzdev chezzdev commented Apr 27, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes the model_tools._run_async() running-loop branch so gateway/async callers reuse a persistent bridge event loop instead of creating and closing a fresh asyncio.run() loop per call.

The old running-loop branch created a disposable worker thread and event loop for every async-context tool call. Cached async clients such as AsyncOpenAI/httpx can outlive those short-lived loops, which creates stale-loop cleanup hazards and descriptor churn in long-lived gateway processes.

Fixes #16570.

Architectural note

This PR keeps the architectural direction aligned with the rest of model_tools._run_async(): main-thread and worker-thread paths already use persistent loops, and this change gives running-loop callers the same stable-loop behavior via asyncio.run_coroutine_threadsafe().

It also preserves the important timeout-cancellation behavior added by #17428. That PR fixed the "timed-out coroutine keeps running in a worker thread" bug, but kept the per-call worker-loop bridge. This PR keeps cancellation semantics while removing the per-call loop churn that #16570 reports.

e7feb84 independently attempted the same dedicated bridge-loop shape, which is useful signal that the architecture matches the issue. This implementation hardens that idea with locking, startup failure cleanup, explicit CLI/gateway shutdown, nested-call protection, health-checked bridge retirement, cancellation-safe one-off fallback, retired-loop closure, and regression coverage.

Related work checked

Changes Made

  • model_tools.py: add a dedicated persistent async bridge loop for callers already inside a running event loop.
  • model_tools.py: add startup failure handling so a dead bridge thread is not published as usable state.
  • model_tools.py: avoid self-deadlock when _run_async() is called from code already executing on the bridge loop by using a cancellation-safe one-off worker loop.
  • model_tools.py: on timeout, cancel the scheduled bridge future and retire the shared bridge only if a health probe shows the bridge loop is stuck.
  • model_tools.py: close retired bridge loops after their bridge thread exits, and keep normal shutdown from double-closing thread-owned loops.
  • cli.py and gateway/run.py: call shutdown_async_bridge_loop() from existing cleanup paths after cached auxiliary clients are closed.
  • Tests: cover bridge reuse, startup failure cleanup, timeout cancellation, nested bridge-loop calls, healthy in-flight work during another timeout, retired-loop closure, and CLI/gateway shutdown integration.

How to Test

uv run --frozen --extra dev python -m pytest -o addopts= --ignore=tests/integration --ignore=tests/e2e -m "not integration" tests/test_model_tools_async_bridge.py tests/test_model_tools.py tests/cli/test_session_boundary_hooks.py tests/gateway/test_gateway_shutdown.py -q
58 passed, 1 warning in 22.47s
python3 -m py_compile model_tools.py gateway/run.py tests/test_model_tools_async_bridge.py tests/gateway/test_gateway_shutdown.py tests/cli/test_session_boundary_hooks.py cli.py
exit 0; emits the existing cli.py SyntaxWarning about `return` in `finally`
git diff --check origin/main...HEAD
exit 0

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs to make sure this isn't a duplicate
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: macOS, Python 3.13.12

Documentation & Housekeeping

  • I've updated relevant documentation -- N/A
  • I've updated cli-config.yaml.example if I added/changed config keys -- N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows -- N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide
  • I've updated tool descriptions/schemas if I changed tool behavior -- N/A

@chezzdev chezzdev force-pushed the codex/fix-model-tools-async-bridge branch from 382152c to ea06a63 Compare April 27, 2026 13:42
@chezzdev chezzdev marked this pull request as ready for review April 27, 2026 13:51
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels Apr 27, 2026
@chezzdev chezzdev force-pushed the codex/fix-model-tools-async-bridge branch from c26200d to d5f448c Compare April 30, 2026 20:55
@chezzdev chezzdev force-pushed the codex/fix-model-tools-async-bridge branch from d5f448c to 41e0889 Compare June 5, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: model_tools async bridge recreates loops in running-loop contexts

2 participants