Skip to content

Bug: "RuntimeError: Event loop is closed" crashes CLI mid-session with "Press ENTER to continue..." #3436

@123mikeyd

Description

@123mikeyd

Summary

During normal CLI usage, stale AsyncOpenAI / AsyncAnthropic client objects that get garbage-collected mid-session trigger an unhandled RuntimeError: Event loop is closed exception. This activates prompt_toolkit's exception handler, which prints the traceback and halts the session with "Press ENTER to continue..." — forcing the user to intervene manually.

This can happen repeatedly in a single session if multiple stale clients accumulate (e.g., during heavy tool use), making the CLI unusable until the user presses ENTER each time.

Root Cause

The crash chain (7 components):

  1. Tool execution creates async clients on the tool loop. _run_async() in model_tools.py runs coroutines on a persistent _tool_loop via run_until_complete(). Some tools (e.g., mixture_of_agents via openrouter_client.py, or auxiliary_client.py for summarization/compression) create AsyncOpenAI clients during execution. These clients internally create httpx.AsyncClient instances bound to _tool_loop.

  2. Clients escape the cache. openrouter_client.py stores a global _client outside auxiliary_client._client_cache. Any AsyncOpenAI client created by resolve_provider_client() called directly (not through _get_cached_client()) is also untracked. These clients are invisible to shutdown_cached_clients().

  3. Stale clients get garbage-collected. When a client reference is dropped (module reload, replacement, scope exit), Python's GC eventually collects it. The AsyncHttpxClientWrapper.__del__ method (in both openai/_base_client.py:1429 and anthropic/_base_client.py:1537) fires.

  4. __del__ schedules aclose() on the wrong loop. The __del__ method calls:

    asyncio.get_running_loop().create_task(self.aclose())

    If GC runs while prompt_toolkit's event loop is active (which it almost always is during a CLI session), get_running_loop() returns prompt_toolkit's loop, not the _tool_loop the client was created on.

  5. aclose() cascades into dead transport. The aclose() task runs on prompt_toolkit's loop and walks the chain: httpx.AsyncClient.aclose()httpcore.AsyncConnectionPool.aclose()AsyncHTTP11Connection.aclose()anyio.TLSStream.aclose()asyncio._SelectorSocketTransport.close(). At this final step, the transport calls self._loop.call_soon(self._call_connection_lost, None) — but self._loop is the original tool loop, which may be in a different state.

  6. call_soon hits the closed/dead loop. base_events.py:795 calls self._check_closed() which raises RuntimeError('Event loop is closed').

  7. prompt_toolkit catches and halts. Since prompt_toolkit installed _handle_exception as the event loop's exception handler (line 830 of application.py), asyncio calls it with the unhandled exception. This prints the traceback and awaits _do_wait_for_enter("Press ENTER to continue...") — blocking the entire CLI until the user presses ENTER.

Full traceback

Unhandled exception in event loop:
  File "...site-packages/httpx/_client.py", line 1985, in aclose
    await self._transport.aclose()
  File "...site-packages/httpx/_transports/default.py", line 406, in aclose
    await self._pool.aclose()
  File "...site-packages/httpcore/_async/connection_pool.py", line 353, in aclose
    await self._close_connections(closing_connections)
  File "...site-packages/httpcore/_async/connection_pool.py", line 345, in _close_connections
    await connection.aclose()
  File "...site-packages/httpcore/_async/connection.py", line 173, in aclose
    await self._connection.aclose()
  File "...site-packages/httpcore/_async/http11.py", line 258, in aclose
    await self._network_stream.aclose()
  File "...site-packages/httpcore/_backends/anyio.py", line 53, in aclose
    await self._stream.aclose()
  File "...site-packages/anyio/streams/tls.py", line 241, in aclose
    await self.transport_stream.aclose()
  File "...site-packages/anyio/_backends/_asyncio.py", line 1329, in aclose
    self._transport.close()
  File ".../asyncio/selector_events.py", line 1211, in close
    super().close()
  File ".../asyncio/selector_events.py", line 875, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File ".../asyncio/base_events.py", line 795, in call_soon
    self._check_closed()
  File ".../asyncio/base_events.py", line 541, in _check_closed
    raise RuntimeError('Event loop is closed')
Exception Event loop is closed
Press ENTER to continue...

Versions

  • hermes-agent 0.4.0
  • httpx 0.28.1
  • httpcore 1.0.9
  • openai (has AsyncHttpxClientWrapper.del)
  • anthropic (has AsyncHttpxClientWrapper.del)
  • prompt_toolkit 3.0.52
  • Python 3.12

Existing Mitigations (and why they're insufficient)

The codebase already has two defenses:

  1. _force_close_async_httpx() + shutdown_cached_clients() in auxiliary_client.py — marks cached AsyncOpenAI clients as CLOSED before __del__ can fire. But this only covers clients in _client_cache and only runs at shutdown. Clients outside the cache (e.g., openrouter_client._client) and mid-session GC events are unprotected.

  2. Persistent _tool_loop in model_tools.py — keeps the event loop alive so cached clients don't reference a dead loop. But the __del__ scheduling path runs on prompt_toolkit's loop, not the tool loop, so the transport's internal self._loop reference still points to a potentially stale loop.

Proposed Fix (Option B — root cause)

1. Register ALL async clients for cleanup

Create a central registry in auxiliary_client.py that tracks every AsyncOpenAI/AsyncAnthropic client created anywhere, not just cached ones:

# auxiliary_client.py — add near _client_cache

_all_async_clients: list = []  # WeakRef list of all async clients
_all_async_clients_lock = threading.Lock()

def _track_async_client(client: Any) -> None:
    """Register an async client for cleanup on shutdown."""
    import weakref
    with _all_async_clients_lock:
        _all_async_clients.append(weakref.ref(client))

def _force_close_all_async_clients() -> None:
    """Mark ALL tracked async clients as closed to prevent __del__ crashes."""
    with _all_async_clients_lock:
        for ref in _all_async_clients:
            client = ref()
            if client is not None:
                _force_close_async_httpx(client)
        _all_async_clients.clear()

Update shutdown_cached_clients() to also call _force_close_all_async_clients().

2. Track clients at creation points

In auxiliary_client.py resolve_provider_client(), after creating an async client:

if async_mode and client is not None:
    _track_async_client(client)

In openrouter_client.py:

def get_async_client():
    global _client
    if _client is None:
        from agent.auxiliary_client import resolve_provider_client, _track_async_client
        client, _model = resolve_provider_client("openrouter", async_mode=True)
        if client is None:
            raise ValueError("OPENROUTER_API_KEY environment variable not set")
        _track_async_client(client)
        _client = client
    return _client

3. Install a custom exception handler on prompt_toolkit's loop (defense-in-depth)

Even with perfect client tracking, third-party code could create untracked async clients. Install a wrapper around prompt_toolkit's exception handler that suppresses RuntimeError: Event loop is closed during aclose():

# cli.py — during startup, after prompt_toolkit app is created

def _make_safe_exception_handler(original_handler):
    """Wrap prompt_toolkit's exception handler to suppress aclose() Event loop crashes."""
    def safe_handler(loop, context):
        exception = context.get("exception")
        if isinstance(exception, RuntimeError) and "Event loop is closed" in str(exception):
            # Suppress — this is a harmless GC cleanup failure from a stale
            # httpx/openai/anthropic async client. The connections will be
            # dropped by the OS. Logging at debug level for diagnostics.
            import logging
            logging.debug(
                "Suppressed 'Event loop is closed' from async client GC cleanup: %s",
                context.get("message", ""),
            )
            return
        # All other exceptions: delegate to prompt_toolkit's handler
        if original_handler is not None:
            original_handler(loop, context)
        else:
            loop.default_exception_handler(context)
    return safe_handler

Install it right after the prompt_toolkit Application is created:

loop = asyncio.get_event_loop()
original = loop.get_exception_handler()
loop.set_exception_handler(_make_safe_exception_handler(original))

4. Neutering at loop shutdown time

Update _run_cleanup() to neutering async clients before the event loop closes:

def _run_cleanup():
    global _cleanup_done
    if _cleanup_done:
        return
    _cleanup_done = True
    # ... existing cleanup ...
    
    # Close ALL async clients (cached + tracked) to prevent __del__ crashes
    try:
        from agent.auxiliary_client import shutdown_cached_clients, _force_close_all_async_clients
        shutdown_cached_clients()
        _force_close_all_async_clients()
    except Exception:
        pass

Impact

  • User-facing: CLI halts mid-session requiring manual ENTER press. Can happen repeatedly. Particularly triggered by heavy tool use (image generation, vision analysis, mixture-of-agents) that creates async HTTP connections.
  • Triggered by: Any operation that creates and later discards an AsyncOpenAI/AsyncAnthropic client while prompt_toolkit's event loop is running. Also triggered by httpx.AsyncClient instances whose underlying transports reference a closed/different event loop.
  • Frequency: Intermittent — depends on GC timing. More likely during sessions with many tool calls.

Reproduction

  1. Start CLI session
  2. Use tools heavily that trigger async client creation (image_generate, mixture_of_agents, vision_analyze in rapid succession)
  3. Wait — GC will eventually collect a stale client
  4. Observe "Unhandled exception in event loop" + "Press ENTER to continue..."

The fastest reproduction path is anything that causes a long-running HTTP connection to time out or be abandoned while the tool loop is busy — e.g., a curl upload to an unresponsive host via terminal_tool while httpx.AsyncClient instances exist from prior tool calls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions