Skip to content

fix(coding-agent): abort in-flight LLM call on AgentSession.dispose()#5029

Closed
TerminallyChilI wants to merge 1 commit into
earendil-works:mainfrom
TerminallyChilI:fix/dispose-aborts-in-flight-call
Closed

fix(coding-agent): abort in-flight LLM call on AgentSession.dispose()#5029
TerminallyChilI wants to merge 1 commit into
earendil-works:mainfrom
TerminallyChilI:fix/dispose-aborts-in-flight-call

Conversation

@TerminallyChilI

Copy link
Copy Markdown

AgentSession.dispose() removes the host's event listener via _disconnectFromAgent() but does not abort any in-flight LLM HTTP request. Callers that dispose a session mid-stream (every switchSession / newSession / fork / clone via teardownCurrent) leave the previous LLM call running in the background:

  • Orphaned TCP socket to the LLM provider — stays ESTABLISHED until the provider responds, often minutes for slow-stream APIs.
  • Provider billing increments for the orphan call whose output is discarded.
  • External observers see the process at 0% CPU with one ESTABLISHED HTTPS socket and no events flowing, indistinguishable from a wedged process.

Adds a synchronous this.agent.abort() at the top of dispose(), wrapped in try/catch so dispose remains exception-safe. AbortController.abort() is synchronous: the signal trips immediately even though the fetch rejection lands later. Keeps dispose() synchronous so existing callers don't need to be updated.

Discovered by PIRA when a session change against an already-loaded session triggered the orphan path, surfacing as 15-20 minute apparent hangs after any host-driven session change while a stream was in flight. PIRA shipped a host-side workaround (skip the redundant switch_session RPC when the bridge already has the session loaded) but the underlying dispose-without-abort behaviour bites any host that legitimately tears down a session mid-stream.

Recent in-the-wild repro: captured today (2026-05-26) — bridge sat wedged for 25 min after switch_session mid-turn. Trace tagged switch_session-teardown + subscriberless + llm-http-hang by our analyzer. Confirms the failure mode is legitimate.

Tests:

  • agent-session-dispose-aborts.test.ts (4 cases): signal trips on dispose mid-stream, agent.abort called exactly once, exception-safe when agent.abort throws, idempotent on re-dispose.

Open question for review: dispose() only aborts the main agent. The session has several other AbortControllers (compaction, autoCompaction, branchSummary, retry, bash) that aren't tripped here. Recommend a follow-up PR (or expanding this one) to abort those too in dispose(), same orphan class, same fix pattern.

AgentSession.dispose() removes the host's event listener via
_disconnectFromAgent() but does not abort any in-flight LLM HTTP
request. Callers that dispose a session mid-stream (every
switchSession / newSession / fork / clone via teardownCurrent)
leave the previous LLM call running in the background:

  * Orphaned TCP socket to the LLM provider — stays ESTABLISHED
    until the provider responds, often minutes for slow-stream APIs.
  * Provider billing increments for the orphan call whose output
    is discarded.
  * External observers see the process at 0% CPU with one
    ESTABLISHED HTTPS socket and no events flowing,
    indistinguishable from a wedged process.

Adds a synchronous this.agent.abort() at the top of dispose(),
wrapped in try/catch so dispose remains exception-safe.
AbortController.abort() is synchronous: the signal trips
immediately even though the fetch rejection lands later. Keeps
dispose() synchronous so existing callers don't need to be updated.

Discovered by PIRA (PI Remote Access) when tab-reopen triggered
switchSession against an already-loaded session, surfacing as a
'pi appears to hang for 15-20 minutes after browser tab reopen'
bug class. PIRA shipped a host-side workaround (skip the
redundant switch_session RPC when the bridge already has the
session loaded) but the underlying dispose-without-abort
behaviour bites any host that legitimately tears down a session
mid-stream.

Tests:
- agent-session-dispose-aborts.test.ts (4 cases): signal trips
  on dispose mid-stream, agent.abort called exactly once,
  exception-safe when agent.abort throws, idempotent on
  re-dispose.

Open question for review: dispose() only aborts the main agent.
The session also has 4 other AbortControllers (compaction,
autoCompaction, branchSummary, retry, bash) that aren't tripped
here. Recommend a follow-up PR (or expanding this one) to abort
those too in dispose() — same orphan class, same fix shape.
@badlogic

Copy link
Copy Markdown
Collaborator

Closing because the PR author is not listed in .github/APPROVED_CONTRIBUTORS.

@badlogic badlogic closed this May 28, 2026
@badlogic badlogic reopened this May 28, 2026
@badlogic badlogic added the inprogress Issue is being worked on label May 28, 2026
@badlogic

Copy link
Copy Markdown
Collaborator

Implemented the fix locally with the narrower lifecycle change discussed here: AgentSession.dispose() now synchronously aborts all session-owned in-flight work before invalidating/disconnecting the session.

Covered controllers:

  • main agent run
  • retry delay
  • manual/auto compaction
  • branch summary
  • bash execution

Kept dispose() synchronous and exception-safe rather than calling AgentSession.abort(), because abort() waits for idle and does not cover all of these controllers.

Validation: npm run check passes.

This comment is AI-generated by /wr

@badlogic badlogic closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inprogress Issue is being worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants