Skip to content

fix(control-ui): preserve Stop after reconnect#76111

Merged
BunsDev merged 4 commits intomainfrom
meow/fix-control-stop-reconnect
May 2, 2026
Merged

fix(control-ui): preserve Stop after reconnect#76111
BunsDev merged 4 commits intomainfrom
meow/fix-control-stop-reconnect

Conversation

@BunsDev
Copy link
Copy Markdown
Member

@BunsDev BunsDev commented May 2, 2026

Summary

  • add an authenticated Gateway WebSocket protocol ping loop so long-running dashboard sessions do not rely only on application-level tick frames
  • expose hasActiveRun on sessions.list and sessions.changed from the Gateway's active chat abort controller registry
  • keep Control UI Stop visible after reconnect/reload when the server still has an active run for the session, and queue/send session-scoped chat.abort when the local runId was lost

Closes #70991.
Related: #56668, #74783, #70230.

Root cause

The backend already supports chat.abort with { sessionKey } and no runId, but the Control UI only rendered Stop from local chatRunId. A reconnect/reload clears that local value, so the user loses the Stop affordance even when the Gateway still has an abortable active run. The gateway WebSocket path also still lacked protocol-level ping frames, leaving long tool-call windows dependent on JSON tick delivery.

User-visible behavior

  • Dashboard WebSocket connections now send protocol ping frames every 25s after a client is authenticated.
  • Stop remains available when sessions.list / sessions.changed reports an active run for the current session.
  • If the browser is disconnected and only session-level active-run state is known, clicking Stop queues a session-scoped abort and sends chat.abort { sessionKey } after reconnect.

Security review

  • New permissions/capabilities: No.
  • Secrets/tokens handling changed: No.
  • New network surface: No new endpoint; only WebSocket control-frame pings on existing authenticated gateway sockets.
  • Command/tool execution surface changed: No.
  • Data access scope changed: No.

The ping loop skips unauthenticated pre-handshake sockets and is cleared during the existing connection teardown path.

Validation

  • pnpm test ui/src/ui/app-chat.test.ts ui/src/ui/app-gateway.node.test.ts src/gateway/server.sessions.list-changed.test.ts src/gateway/server/ws-connection.test.ts
  • pnpm test src/gateway/server.sessions.list-changed.test.ts
  • pnpm test src/gateway/server/ws-connection.test.ts
  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/gateway/server-methods/sessions.ts src/gateway/session-utils.types.ts src/gateway/server/ws-connection.ts src/gateway/server/ws-connection.test.ts src/gateway/server.sessions.list-changed.test.ts ui/src/ui/app-chat.ts ui/src/ui/app-chat.test.ts ui/src/ui/app-gateway.ts ui/src/ui/app-gateway.node.test.ts ui/src/ui/app-render.ts ui/src/ui/controllers/sessions.ts ui/src/ui/types.ts
  • OPENCLAW_LOCAL_CHECK=1 OPENCLAW_LOCAL_CHECK_MODE=throttled pnpm check:changed

Notes

I attempted the repo-preferred Blacksmith Testbox changed gate, but the Testbox sync deleted tracked sparse/large files and removed node_modules, causing unrelated all-lane failures. I stopped that Testbox and used the local changed gate with dependencies installed.

@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime size: M maintainer Maintainer-authored PR labels May 2, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 2, 2026

Codex review: needs changes before merge.

Summary
The PR adds Gateway WebSocket pings, exposes session active-run state, and uses that state so Control UI can keep Stop and session-scoped aborts available after reconnect or reload.

Reproducibility: yes. The PR diff itself gives a high-confidence reproduction: invoke sessions.list after this patch and it reaches chatHandlers.chatAbortControllers, which is not the active-run registry and will fail before returning rows.

Next step before merge
A narrow automated PR-branch repair can replace the broken sessions.list active-run lookup and lock it with the intended regression test.

Security
Cleared: No concrete security or supply-chain concern was found; the diff changes gateway/UI runtime state, tests, and changelog only, with no new dependencies, workflow permissions, secrets handling, package metadata, or endpoint.

Review findings

  • [P2] Use the request context for active runs — src/gateway/server-methods/sessions.ts:681
Review details

Best possible solution:

Keep the PR direction, but repair the active-run lookup to use the request context registry and stored session keys, then land with focused gateway/UI tests and the existing changelog entry.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR diff itself gives a high-confidence reproduction: invoke sessions.list after this patch and it reaches chatHandlers.chatAbortControllers, which is not the active-run registry and will fail before returning rows.

Is this the best way to solve the issue?

No. The backend/UI model is the right narrow approach, but this implementation uses the wrong object for active abort state; a context-scoped lookup is the safer maintainable fix.

Full review comments:

  • [P2] Use the request context for active runs — src/gateway/server-methods/sessions.ts:681
    This new branch dereferences chatHandlers.chatAbortControllers, but chatHandlers is the RPC handler map and has no controller registry. As soon as sessions.list runs, this fails before returning any rows, so Control UI session refreshes break; iterate context.chatAbortControllers and use the entry's sessionKey instead.
    Confidence: 0.96

Overall correctness: patch is incorrect
Overall confidence: 0.94

Acceptance criteria:

  • pnpm test src/gateway/server.sessions.list-changed.test.ts
  • pnpm test ui/src/ui/app-chat.test.ts ui/src/ui/app-gateway.node.test.ts src/gateway/server/ws-connection.test.ts
  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/gateway/server-methods/sessions.ts src/gateway/server.sessions.list-changed.test.ts src/gateway/server/ws-connection.ts src/gateway/server/ws-connection.test.ts ui/src/ui/app-chat.ts ui/src/ui/app-chat.test.ts ui/src/ui/app-gateway.ts ui/src/ui/app-gateway.node.test.ts ui/src/ui/app-render.ts ui/src/ui/controllers/sessions.ts ui/src/ui/types.ts

What I checked:

  • Blocking PR diff bug: The PR's new active-run loop reads chatHandlers.chatAbortControllers.size and passes a controller entry to resolveSessionKeyForRun, but the abort-controller registry is request-context state, not a property on the handler map. (src/gateway/server-methods/sessions.ts:681, bb97cde5f368)
  • Current context contract: Current main defines GatewayRequestContext.chatAbortControllers as the active-run registry, while GatewayRequestHandlers is only Record<string, GatewayRequestHandler>. (src/gateway/server-methods/shared-types.ts:71, 15cb06430e8e)
  • Existing session-scoped abort support: Current main already supports chat.abort with sessionKey and no runId, returning the aborted run ids from the active registry. (src/gateway/server-methods/chat.ts:1763, 15cb06430e8e)
  • Reconnect path loads chat session state: Current main's Chat tab refresh calls refreshChat, and refreshChat loads sessions.list, so the broken PR sessions.list branch is on the intended reconnect/reload recovery path. (ui/src/ui/app-settings.ts:400, 15cb06430e8e)
  • Protected handling signal: The provided GitHub context shows author association MEMBER and the protected maintainer label, so this PR should not be cleanup-closed by the sweep even aside from the review finding. (bb97cde5f368)

Likely related people:

  • steipete: Recent current-main commits heavily touched the sessions and gateway paths that own sessions.list, exact session lookup, and WebSocket connection handling. (role: recent maintainer; confidence: high; commits: 0ea28ddb165d, b151694e0048, 8d58ad4c15cd; files: src/gateway/server-methods/sessions.ts, src/gateway/server/ws-connection.ts, ui/src/ui/app-gateway.ts)
  • bitloi: Merged PR fix(gateway): allow chat.abort to stop agent RPC runs #71214 connected agent RPC runs to the shared chat abort-controller registry that this PR now tries to expose to session rows. (role: adjacent abort-controller contributor; confidence: medium; commits: 8cae2ed645ed; files: src/gateway/server-methods/agent.ts, src/gateway/server-methods/chat.ts)
  • BunsDev: Beyond authoring this PR, recent merged current-main history shows Control UI chat, slash-command, and refresh work in the affected UI surface. (role: recent Control UI maintainer; confidence: medium; commits: 64fcc8a1aadb, 2bade2703e35, 37aebf612b83; files: ui/src/ui/app-gateway.ts, ui/src/ui/app-chat.ts)

Remaining risk / open question:

  • The active-run session-row plumbing is unverified until sessions.list uses context.chatAbortControllers and the focused regression test passes on the corrected code.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 15cb06430e8e.

@BunsDev BunsDev self-assigned this May 2, 2026
@BunsDev BunsDev requested a review from Copilot May 2, 2026 14:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Control UI’s lost “Stop” affordance after reconnect/reload by recovering abortability from gateway session state, and improves long-running dashboard connection stability by adding WebSocket protocol-level keepalive pings.

Changes:

  • Add a server-side WebSocket ping() keepalive loop (25s) for authenticated gateway WS clients.
  • Expose hasActiveRun on sessions.list and sessions.changed, derived from the gateway’s active chat abort-controller registry.
  • Update Control UI abort logic to keep Stop available after reconnect and to queue/send session-scoped chat.abort { sessionKey } when runId is no longer known.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ui/src/ui/types.ts Adds hasActiveRun?: boolean to GatewaySessionRow for UI consumption.
ui/src/ui/controllers/sessions.ts Ensures hasActiveRun is requested/handled in session event rows.
ui/src/ui/app-render.ts Uses hasAbortableSessionRun(state) instead of Boolean(chatRunId) to keep Stop visible after reconnect.
ui/src/ui/app-gateway.ts Allows queued aborts without runId and sends session-scoped chat.abort after reconnect.
ui/src/ui/app-gateway.node.test.ts Adds coverage for queued session-scoped abort sending after reconnect.
ui/src/ui/app-chat.ts Introduces hasAbortableSessionRun() and queues aborts while disconnected even if runId was lost.
ui/src/ui/app-chat.test.ts Adds test for disconnected session-scoped abort queuing after recovering active-run state.
src/gateway/session-utils.types.ts Adds hasActiveRun?: boolean to gateway session row typing.
src/gateway/server/ws-connection.ts Adds 25s protocol ping loop and clears it during connection teardown.
src/gateway/server/ws-connection.test.ts Adds timer-based test verifying pings stop after socket close; restores real timers after tests.
src/gateway/server.sessions.list-changed.test.ts Adds test asserting sessions.list marks sessions with active abortable runs.
src/gateway/server-methods/sessions.ts Populates hasActiveRun in sessions.list responses and sessions.changed events.
CHANGELOG.md Documents the user-visible fix for #70991.

Comment thread src/gateway/server-methods/sessions.ts
Comment thread src/gateway/server/ws-connection.ts Outdated
@BunsDev
Copy link
Copy Markdown
Member Author

BunsDev commented May 2, 2026

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented May 2, 2026

@copilot apply changes based on the comments in this thread

Both review comments are now addressed:

  1. sessions.ts O(n×m) lookup (already fixed in a prior commit, bb97cde5): sessions.list now precomputes a Set<string> of active session keys once before mapping, so the per-session lookup is O(1) instead of iterating over all abort controllers each time.

  2. Ping timer deferred to post-auth (fixed in 6f76ed3d): The setInterval for protocol pings is no longer created immediately on connection. It is now started inside the setClient closure — only after the client is authenticated — removing the pre-auth no-op guard and the associated per-connection timer overhead during the handshake window. The clearInterval in close() is guarded for the case where the timer was never started.

Copilot finished work on behalf of BunsDev May 2, 2026 15:22
@BunsDev BunsDev force-pushed the meow/fix-control-stop-reconnect branch from 6f76ed3 to 1f4c872 Compare May 2, 2026 15:37
@BunsDev BunsDev merged commit 4532e5d into main May 2, 2026
102 checks passed
@BunsDev BunsDev deleted the meow/fix-control-stop-reconnect branch May 2, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: connection drops between gateway and OpenCL Control + unable to stop running task with /stop and button disapear

3 participants