fix(chrome-relay): auto-reconnect, MV3 persistence, and keepalive#15817
fix(chrome-relay): auto-reconnect, MV3 persistence, and keepalive#15817derrickburns wants to merge 6 commits intoopenclaw:mainfrom
Conversation
The Chrome extension relay loses connection after navigation, sleep/wake, or service worker restarts and never recovers. This is because: 1. No reconnection logic exists — WebSocket drops are permanent 2. MV3 service worker restarts wipe all in-memory state 3. No keepalive prevents Chrome from killing the idle worker 4. chrome.debugger detaches on navigation with no re-attach This patch adds: - Auto-reconnect with exponential backoff (1s-30s cap, 10 attempts) - State persistence via chrome.storage.local (survives worker restarts) - chrome.alarms keepalive (24s interval, under MV3 30s limit) - Re-attach on debugger detach from navigation/reload - Per-tab operation locks (prevents double-attach race) - Tab lifecycle listeners (cleanup on close/navigate) - Pending request timeouts (30s, prevents memory leaks) - Child session cleanup on parent detach Also adds 'alarms' permission to manifest.json. Tested: snapshot, navigate+snapshot (previously broke here), and extension reload — all working on macOS with Ancestry.com. Relates to openclaw#1160
| chrome.alarms.onAlarm.addListener(async (alarm) => { | ||
| if (alarm.name === 'relay-keepalive') { | ||
| // Check WebSocket health and reconnect if needed | ||
| if (!relayWs || relayWs.readyState !== WebSocket.OPEN) { | ||
| if (!relayConnectPromise && !reconnectTimer) { | ||
| console.log('Keepalive: WebSocket unhealthy, triggering reconnect') | ||
| await ensureRelayConnection().catch(() => { | ||
| // If connection fails, scheduleReconnect will be called by onRelayClosed | ||
| }) | ||
| } |
There was a problem hiding this comment.
Keepalive never reconnects
In the chrome.alarms keepalive handler, ensureRelayConnection() failures are swallowed, but ensureRelayConnection() can throw (e.g., preflight fetch fails or WS connect rejects) without triggering onRelayClosed(), so scheduleReconnect() never runs. This means after an MV3 restart or when the relay is down, the keepalive tick can repeatedly do nothing and the extension may never auto-reconnect. Consider calling scheduleReconnect() from the keepalive catch/failure path (or from ensureRelayConnection failure paths) so reconnect is guaranteed to be scheduled.
Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 687:696
Comment:
**Keepalive never reconnects**
In the `chrome.alarms` keepalive handler, `ensureRelayConnection()` failures are swallowed, but `ensureRelayConnection()` can throw (e.g., preflight `fetch` fails or WS connect rejects) without triggering `onRelayClosed()`, so `scheduleReconnect()` never runs. This means after an MV3 restart or when the relay is down, the keepalive tick can repeatedly do nothing and the extension may never auto-reconnect. Consider calling `scheduleReconnect()` from the keepalive catch/failure path (or from `ensureRelayConnection` failure paths) so reconnect is guaranteed to be scheduled.
How can I resolve this? If you propose a fix, please make it concise.| async function restoreState() { | ||
| try { | ||
| const { extensionState } = await chrome.storage.local.get(['extensionState']) | ||
| if (extensionState) { | ||
| // Restore nextSession counter to avoid ID conflicts | ||
| if (typeof extensionState.nextSession === 'number') { | ||
| nextSession = extensionState.nextSession | ||
| } | ||
|
|
||
| // Validate and restore tabs - some may have closed during service worker downtime | ||
| if (Array.isArray(extensionState.attachedTabs)) { | ||
| for (const [tabId, tabState] of extensionState.attachedTabs) { | ||
| try { | ||
| const tab = await chrome.tabs.get(tabId) | ||
| if (tab) { | ||
| tabs.set(tabId, tabState) | ||
| if (tabState.sessionId) { | ||
| tabBySession.set(tabState.sessionId, tabId) | ||
| } | ||
| } | ||
| } catch { | ||
| // Tab no longer exists, skip it | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Restore child session mappings for still-valid tabs | ||
| if (Array.isArray(extensionState.childSessions)) { | ||
| for (const [sessionId, tabId] of extensionState.childSessions) { | ||
| if (tabs.has(tabId)) { | ||
| childSessionToTab.set(sessionId, tabId) | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } catch (err) { | ||
| console.warn('Failed to restore state:', err) | ||
| } | ||
| } |
There was a problem hiding this comment.
Restored state not attached
restoreState() repopulates tabs/tabBySession/childSessionToTab, but it never re-attaches chrome.debugger for those tabs or refreshes badge/title. After an MV3 service worker restart, this leaves the extension believing tabs are attached/connected while Chrome has no debugger session, which can break command routing and makes the UI state misleading until the user manually toggles. A fix is to either (a) re-attach debuggers (and send attached events) for restored tabs, or (b) treat restored tabs as disconnected and require a fresh attach.
Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 77:115
Comment:
**Restored state not attached**
`restoreState()` repopulates `tabs`/`tabBySession`/`childSessionToTab`, but it never re-attaches `chrome.debugger` for those tabs or refreshes badge/title. After an MV3 service worker restart, this leaves the extension believing tabs are attached/connected while Chrome has no debugger session, which can break command routing and makes the UI state misleading until the user manually toggles. A fix is to either (a) re-attach debuggers (and send attached events) for restored tabs, or (b) treat restored tabs as disconnected and require a fresh attach.
How can I resolve this? If you propose a fix, please make it concise.|
Hey — just a heads up, I posted a detailed root cause analysis and working fixes for these exact issues on #15099 before this PR was opened. Auto-reconnect with exponential backoff, debugger re-attach on navigation, MV3 state persistence — all covered there with code snippets. Would've been nice to get a mention or a "relates to #15099." Not a big deal, but credit where it's due. |
|
To be honest, I never looked at GitHub at all. Not even a little. I just stumbled over the problem and it blocked me. So I asked openclaw to fix itself using Claude and Codex. It did and after testing I told it to create an No disrespect was intended! Apologies! |
…tate re-attaches debuggers Fixes two issues found in code review: 1. Keepalive handler: ensureRelayConnection() can throw without triggering onRelayClosed (e.g. preflight fetch fails before WS creation), leaving no reconnect scheduled. Now explicitly calls scheduleReconnect() from the catch path. 2. restoreState(): After MV3 service worker restart, tab maps were repopulated but chrome.debugger was never re-attached, leaving the extension in a stale state. Now marks restored tabs as disconnected, then re-attaches debuggers after relay connects. Relates to openclaw#15099
…th 409 After a Chrome restart, the old extension WebSocket may not have fired its close event yet. The gateway was rejecting the new connection with 409 'Extension already connected', requiring a gateway restart to clear the stale state. Now: close the stale connection and accept the new one seamlessly.
…nect+reconnect When navigating between pages, Chrome detaches the debugger with reason 'target_closed'. Previously, this triggered a full detachTab() which sent Target.detachedFromTarget events to the relay, breaking active CDP sessions. The 500ms re-attach then created a new session. Now: on navigation detach, skip the relay disconnect notification, show a 'connecting' badge, and re-attach after 500ms. The gateway sees a seamless session replacement instead of a disruptive disconnect+reconnect cycle. Full cleanup only happens for non-navigation detaches (user action, crash, etc.).
If the user clicks the toolbar button to detach during the 500ms navigation re-attach grace period, the timeout would re-attach the tab anyway. Now checks tabs.has(tabId) before re-attaching — if the tab was manually detached, the timeout is a no-op.
… state corruption - extension-relay.ts: Guard close handler against stale WS nulling new connection. When a replaced WS fires its close event, it was clearing extensionWs, connectedTargets, and disconnecting all CDP clients even though a new connection was already active. - background.js: Reset reconnectAttempts on any successful connection, not just auto-reconnect. Prevents exhausted counter from blocking future auto-reconnects after manual recovery. - background.js: Add tabOperationLocks to reattachKnownTabs to prevent races with concurrent user toolbar clicks during reconnection.
Testing ResultsTorture Test SuiteTwo test suites were created and run against a managed (isolated) Chrome instance ( 1. Aggressive Stress TestRapid-fire operations with minimal delays:
Result: 8/8 passed 2. Human-Paced Endurance Test (30 minutes)Simulates realistic browsing with natural delays (2-6s reading pauses, 15-30s idle periods):
Result: 226/228 passed over 30 minutes (99.1%) The 2 failures were both the managed Chrome process crashing (not relay bugs) — the relay detected and recovered automatically each time. Zero relay/extension failures across the entire run. Code ReviewTwo independent code reviews were run (different model from the author): Security/Safety Review found:
Logic/Correctness Review found:
All critical findings from code review were addressed before testing. |
bfc1ccb to
f92900f
Compare
|
Running OpenClaw on WSL2 and hit this exact issue — after every gateway restart, the extension drops and requires a manual re-click to re-attach. We patched
Would love to see this merged officially so we don't have to maintain a local patch. The current UX (manual re-click after every restart) is a real pain point for anyone running a persistent setup. 🙏 |
|
Thanks for the detailed work here. Closing as superseded by newer Merging this branch now would effectively roll back newer changes and reintroduce divergence. |
Problem
The Chrome extension relay drops connection after page navigation, sleep/wake cycles, or MV3 service worker restarts — and never recovers. Users must manually re-click the toolbar icon after every navigation. Related: #1160
Root Cause
Five failure modes identified via code audit of background.js and the relay server:
Fix
Drop-in replacement for background.js + one manifest permission (alarms):
Testing
Tested on macOS (Chrome Profile 11) against Ancestry.com:
Changes
No changes to relay server protocol, options page, or CDP command handling.
Greptile Overview
Greptile Summary
This PR rewrites the Chrome extension service worker (
assets/chrome-extension/background.js) to make the relay connection resilient: it adds auto-reconnect with backoff, per-tab operation locks, request timeouts for pending relay RPCs, and state persistence viachrome.storage.local. It also introduces a keepalive alarm (chrome.alarms) to keep the MV3 service worker active, and tab lifecycle handling (onRemoved/onUpdated) plus navigation-triggered re-attach logic for debugger detaches.assets/chrome-extension/manifest.jsonis updated to request the newalarmspermission required for the keepalive.Confidence Score: 3/5
Last reviewed commit: 1891255
(2/5) Greptile learns from your feedback when you react with thumbs up/down!
Relates to #15099