Skip to content

Inflight cards occasionally fail to auto-close — derive running from an authoritative inflight set #557

@esengine

Description

@esengine

Summary

Occasionally a card's spinner keeps spinning after the underlying
work has finished — the card stays at status: "running" instead of
flipping to done / failed. Observed in dogfooding, no specific
repro yet; filing this so the fix lands proactively rather than
waiting for a user-visible regression.

Affected cards: anything with a running status (tool calls,
subagents, streaming, reasoning) — see cards.ts:54/66/73/142.

Root cause hypothesis

Status today is set imperatively from paired events: an event
flips status to running, a later event flips it to done /
failed. Every code path that exits a tool dispatch must remember to
emit the closing event. We have a lot of exit paths:

  • normal return
  • thrown error caught upstream
  • parent abort propagating into a running child
  • storm-breaker truncation
  • subagent forced-summary path
  • network drop mid-stream

Miss one, and the spinner stays running forever. That's the shape of
this bug.

Proposed fix — derive, don't accumulate

Make running a derived view of the loop's authoritative
in-flight set, not an event-driven flag.

  1. Add a Set<string> in loop.ts keyed by call id (tool calls,
    subagent run ids). Insert at dispatch entry. Remove in a
    finally block — the language guarantees finally runs on every
    exit path, including thrown / aborted / returned. No way to leak.

  2. Expose the set (or a subscribe API) to the UI layer.

  3. UI cards drop their status: "running" writes for inflight-ness.
    They render running iff their id is in the set, done /
    failed otherwise (the existing terminal events still carry the
    success/error distinction; only the running flag is derived).

The whole class of "we forgot to emit the end event on path X"
disappears: there's no end event to forget. finally is the contract.

Why not the alternatives

  • Watchdog timeout ("if no progress in 30s, fade") — band-aid,
    hides real bugs, can misclassify long-running web fetches. Useful
    as a secondary safety net (display "silent — Ns ago"), not a
    primary fix.
  • Force-close at turn boundary — wrong for our model: subagents
    and background jobs legitimately span turns. Would mass-kill cards
    that are actually still running.

Scope

  • src/loop.ts — own the inflight set, add insert/remove around
    dispatch + subagent spawn. finally blocks are non-negotiable.
  • src/cli/ui/state/cards.ts — switch running to a derived field.
  • src/cli/ui/cards/*.tsx — update the cards that read status
    for the spinner indicator.
  • Tests — replay-style: feed a sequence of events that historically
    left a stuck spinner and assert the card terminates.

Out of scope

  • Watchdog fallback. Land separately if even the derived approach
    shows residual leaks (it shouldn't).
  • Reworking how terminal status (done vs failed) is decided —
    only the running-or-not flag is being moved to derivation.

Why now

Self-reported, no external user complaint yet. Filing as preventive —
this is the kind of state-machine drift that gets visibly worse as
exit paths multiply (we just added a new one in #549's scrollRows
fix, and every storm-breaker / abort tweak adds another). Easier to
flip to derived state once than to keep bolting closing events onto
every new exit path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrenderingTerminal rendering / flicker / repaint issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions