Skip to content

fix(chrome-relay): auto-reconnect, MV3 persistence, and keepalive#15817

Closed
derrickburns wants to merge 6 commits intoopenclaw:mainfrom
derrickburns:fix/chrome-relay-reconnect
Closed

fix(chrome-relay): auto-reconnect, MV3 persistence, and keepalive#15817
derrickburns wants to merge 6 commits intoopenclaw:mainfrom
derrickburns:fix/chrome-relay-reconnect

Conversation

@derrickburns
Copy link

@derrickburns derrickburns commented Feb 13, 2026

Problem

The Chrome extension relay drops connection after page navigation, sleep/wake cycles, or MV3 service worker restarts — and never recovers. Users must manually re-click the toolbar icon after every navigation. Related: #1160

Root Cause

Five failure modes identified via code audit of background.js and the relay server:

  1. No reconnection logic — WebSocket drops are permanent (extension clears state and stops)
  2. MV3 state amnesia — service worker restarts wipe all in-memory Maps
  3. No keepalive — Chrome kills idle service worker after ~30s
  4. Navigation detaches debugger — chrome.debugger auto-detaches on page navigation with no re-attach
  5. No pending request cleanup — dropped messages leak memory

Fix

Drop-in replacement for background.js + one manifest permission (alarms):

  • Auto-reconnect — Exponential backoff (1s→30s cap, 10 attempts) on WS drop
  • State persistence — chrome.storage.local saves attached tabs, sessions — survives worker restarts
  • Keepalive alarm — chrome.alarms every 24s (under MV3 30s limit) checks WS health
  • Navigation re-attach — On target_closed detach, waits 500ms then re-attaches if tab exists
  • Per-tab locks — Prevents double-attach race from rapid toolbar clicks
  • Tab lifecycle cleanup — onRemoved/onUpdated listeners clean state on close/navigate
  • Request timeouts — 30s timeout on pending requests prevents memory leaks
  • Child session cleanup — Proper detach events for child sessions when parent disconnects

Testing

Tested on macOS (Chrome Profile 11) against Ancestry.com:

  • ✅ Snapshot through relay
  • ✅ Navigate to different page + snapshot (previously broke here)
  • ✅ Extension reload + reconnect

Changes

  • assets/chrome-extension/background.js — +280 lines (reconnect, persistence, keepalive, lifecycle)
  • assets/chrome-extension/manifest.json — added alarms permission

No changes to relay server protocol, options page, or CDP command handling.

Greptile Overview

Greptile Summary

This PR rewrites the Chrome extension service worker (assets/chrome-extension/background.js) to make the relay connection resilient: it adds auto-reconnect with backoff, per-tab operation locks, request timeouts for pending relay RPCs, and state persistence via chrome.storage.local. It also introduces a keepalive alarm (chrome.alarms) to keep the MV3 service worker active, and tab lifecycle handling (onRemoved/onUpdated) plus navigation-triggered re-attach logic for debugger detaches.

assets/chrome-extension/manifest.json is updated to request the new alarms permission required for the keepalive.

Confidence Score: 3/5

  • This PR is close to mergeable but has reconnection/persistence logic gaps that can prevent recovery in common scenarios.
  • Core reconnect/keepalive/persistence changes look coherent, but the keepalive path can fail to schedule reconnect when connection attempts throw early, and restored state currently marks tabs as attached without re-attaching the debugger, which can leave the extension in an inconsistent state after MV3 restarts.
  • assets/chrome-extension/background.js (keepalive/reconnect failure paths, restoreState/reattach behavior)

Last reviewed commit: 1891255

(2/5) Greptile learns from your feedback when you react with thumbs up/down!


Relates to #15099

The Chrome extension relay loses connection after navigation, sleep/wake,
or service worker restarts and never recovers. This is because:

1. No reconnection logic exists — WebSocket drops are permanent
2. MV3 service worker restarts wipe all in-memory state
3. No keepalive prevents Chrome from killing the idle worker
4. chrome.debugger detaches on navigation with no re-attach

This patch adds:
- Auto-reconnect with exponential backoff (1s-30s cap, 10 attempts)
- State persistence via chrome.storage.local (survives worker restarts)
- chrome.alarms keepalive (24s interval, under MV3 30s limit)
- Re-attach on debugger detach from navigation/reload
- Per-tab operation locks (prevents double-attach race)
- Tab lifecycle listeners (cleanup on close/navigate)
- Pending request timeouts (30s, prevents memory leaks)
- Child session cleanup on parent detach

Also adds 'alarms' permission to manifest.json.

Tested: snapshot, navigate+snapshot (previously broke here), and
extension reload — all working on macOS with Ancestry.com.

Relates to openclaw#1160
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +687 to +696
chrome.alarms.onAlarm.addListener(async (alarm) => {
if (alarm.name === 'relay-keepalive') {
// Check WebSocket health and reconnect if needed
if (!relayWs || relayWs.readyState !== WebSocket.OPEN) {
if (!relayConnectPromise && !reconnectTimer) {
console.log('Keepalive: WebSocket unhealthy, triggering reconnect')
await ensureRelayConnection().catch(() => {
// If connection fails, scheduleReconnect will be called by onRelayClosed
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keepalive never reconnects

In the chrome.alarms keepalive handler, ensureRelayConnection() failures are swallowed, but ensureRelayConnection() can throw (e.g., preflight fetch fails or WS connect rejects) without triggering onRelayClosed(), so scheduleReconnect() never runs. This means after an MV3 restart or when the relay is down, the keepalive tick can repeatedly do nothing and the extension may never auto-reconnect. Consider calling scheduleReconnect() from the keepalive catch/failure path (or from ensureRelayConnection failure paths) so reconnect is guaranteed to be scheduled.

Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 687:696

Comment:
**Keepalive never reconnects**

In the `chrome.alarms` keepalive handler, `ensureRelayConnection()` failures are swallowed, but `ensureRelayConnection()` can throw (e.g., preflight `fetch` fails or WS connect rejects) without triggering `onRelayClosed()`, so `scheduleReconnect()` never runs. This means after an MV3 restart or when the relay is down, the keepalive tick can repeatedly do nothing and the extension may never auto-reconnect. Consider calling `scheduleReconnect()` from the keepalive catch/failure path (or from `ensureRelayConnection` failure paths) so reconnect is guaranteed to be scheduled.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +77 to +115
async function restoreState() {
try {
const { extensionState } = await chrome.storage.local.get(['extensionState'])
if (extensionState) {
// Restore nextSession counter to avoid ID conflicts
if (typeof extensionState.nextSession === 'number') {
nextSession = extensionState.nextSession
}

// Validate and restore tabs - some may have closed during service worker downtime
if (Array.isArray(extensionState.attachedTabs)) {
for (const [tabId, tabState] of extensionState.attachedTabs) {
try {
const tab = await chrome.tabs.get(tabId)
if (tab) {
tabs.set(tabId, tabState)
if (tabState.sessionId) {
tabBySession.set(tabState.sessionId, tabId)
}
}
} catch {
// Tab no longer exists, skip it
}
}
}

// Restore child session mappings for still-valid tabs
if (Array.isArray(extensionState.childSessions)) {
for (const [sessionId, tabId] of extensionState.childSessions) {
if (tabs.has(tabId)) {
childSessionToTab.set(sessionId, tabId)
}
}
}
}
} catch (err) {
console.warn('Failed to restore state:', err)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored state not attached

restoreState() repopulates tabs/tabBySession/childSessionToTab, but it never re-attaches chrome.debugger for those tabs or refreshes badge/title. After an MV3 service worker restart, this leaves the extension believing tabs are attached/connected while Chrome has no debugger session, which can break command routing and makes the UI state misleading until the user manually toggles. A fix is to either (a) re-attach debuggers (and send attached events) for restored tabs, or (b) treat restored tabs as disconnected and require a fresh attach.

Prompt To Fix With AI
This is a comment left during a code review.
Path: assets/chrome-extension/background.js
Line: 77:115

Comment:
**Restored state not attached**

`restoreState()` repopulates `tabs`/`tabBySession`/`childSessionToTab`, but it never re-attaches `chrome.debugger` for those tabs or refreshes badge/title. After an MV3 service worker restart, this leaves the extension believing tabs are attached/connected while Chrome has no debugger session, which can break command routing and makes the UI state misleading until the user manually toggles. A fix is to either (a) re-attach debuggers (and send attached events) for restored tabs, or (b) treat restored tabs as disconnected and require a fresh attach.

How can I resolve this? If you propose a fix, please make it concise.

@codexGW
Copy link
Contributor

codexGW commented Feb 14, 2026

Hey — just a heads up, I posted a detailed root cause analysis and working fixes for these exact issues on #15099 before this PR was opened. Auto-reconnect with exponential backoff, debugger re-attach on navigation, MV3 state persistence — all covered there with code snippets.

Would've been nice to get a mention or a "relates to #15099." Not a big deal, but credit where it's due.

@derrickburns
Copy link
Author

derrickburns commented Feb 14, 2026

To be honest, I never looked at GitHub at all. Not even a little. I just stumbled over the problem and it blocked me. So I asked openclaw to fix itself using Claude and Codex. It did and after testing I told it to create an
Issue and a PR. Then when the issue was rejected as a duplicate, I told it to attach this to the original.

No disrespect was intended! Apologies!

…tate re-attaches debuggers

Fixes two issues found in code review:

1. Keepalive handler: ensureRelayConnection() can throw without
   triggering onRelayClosed (e.g. preflight fetch fails before WS
   creation), leaving no reconnect scheduled. Now explicitly calls
   scheduleReconnect() from the catch path.

2. restoreState(): After MV3 service worker restart, tab maps were
   repopulated but chrome.debugger was never re-attached, leaving
   the extension in a stale state. Now marks restored tabs as
   disconnected, then re-attaches debuggers after relay connects.

Relates to openclaw#15099
…th 409

After a Chrome restart, the old extension WebSocket may not have
fired its close event yet. The gateway was rejecting the new
connection with 409 'Extension already connected', requiring a
gateway restart to clear the stale state.

Now: close the stale connection and accept the new one seamlessly.
…nect+reconnect

When navigating between pages, Chrome detaches the debugger with
reason 'target_closed'. Previously, this triggered a full detachTab()
which sent Target.detachedFromTarget events to the relay, breaking
active CDP sessions. The 500ms re-attach then created a new session.

Now: on navigation detach, skip the relay disconnect notification,
show a 'connecting' badge, and re-attach after 500ms. The gateway
sees a seamless session replacement instead of a disruptive
disconnect+reconnect cycle. Full cleanup only happens for
non-navigation detaches (user action, crash, etc.).
If the user clicks the toolbar button to detach during the 500ms
navigation re-attach grace period, the timeout would re-attach
the tab anyway. Now checks tabs.has(tabId) before re-attaching —
if the tab was manually detached, the timeout is a no-op.
… state corruption

- extension-relay.ts: Guard close handler against stale WS nulling new
  connection. When a replaced WS fires its close event, it was clearing
  extensionWs, connectedTargets, and disconnecting all CDP clients even
  though a new connection was already active.

- background.js: Reset reconnectAttempts on any successful connection,
  not just auto-reconnect. Prevents exhausted counter from blocking
  future auto-reconnects after manual recovery.

- background.js: Add tabOperationLocks to reattachKnownTabs to prevent
  races with concurrent user toolbar clicks during reconnection.
@derrickburns
Copy link
Author

Testing Results

Torture Test Suite

Two test suites were created and run against a managed (isolated) Chrome instance (openclaw profile) to avoid interfering with user sessions.

1. Aggressive Stress Test

Rapid-fire operations with minimal delays:

  • Sequential navigation (5 pages, 500ms intervals)
  • Navigate + immediate snapshot race (5x)
  • Rapid tab open (5 tabs, 300ms intervals)
  • Snapshot across all tabs
  • Rapid tab close
  • Error/edge-case URLs (404, 500, about:blank)
  • Machine-gun navigation (10 concurrent navigates, 200ms intervals)
  • Final health check

Result: 8/8 passed

2. Human-Paced Endurance Test (30 minutes)

Simulates realistic browsing with natural delays (2-6s reading pauses, 15-30s idle periods):

  • Multi-page research sessions (navigate → read → follow link)
  • New tab side-quests (open → browse → close)
  • Quick back-and-forth navigation
  • Error page recovery (404 → navigate away)
  • Idle periods (15-30s, simulating user away)
  • Health checks each iteration

Result: 226/228 passed over 30 minutes (99.1%)

The 2 failures were both the managed Chrome process crashing (not relay bugs) — the relay detected and recovered automatically each time. Zero relay/extension failures across the entire run.

Code Review

Two independent code reviews were run (different model from the author):

Security/Safety Review found:

  • 🔴 No auth on /extension WebSocket (any local process can connect) — noted as follow-up
  • 🔴 No CDP method allowlist — architectural, by design
  • ✅ Loopback-only binding, auth on /cdp, proper cleanup

Logic/Correctness Review found:

  • 🔴 Stale WS close handler nulling new connection → fixed in commit 6
  • 🔴 Navigation re-attach racing with manual detach → fixed in commit 5
  • 🔴 Missing tabOperationLocks in reattach paths → fixed in commit 6
  • 🟡 reconnectAttempts not resetting on manual connect → fixed in commit 6

All critical findings from code review were addressed before testing.

@iMikio
Copy link

iMikio commented Feb 20, 2026

Running OpenClaw on WSL2 and hit this exact issue — after every gateway restart, the extension drops and requires a manual re-click to re-attach.

We patched background.js locally with auto-reconnect + exponential backoff (similar approach to what's described here), and it works great. The flow is:

  1. Relay disconnects → save attached tab IDs
  2. Retry with exponential backoff (2s, 4s, 6s... up to 10 attempts)
  3. On failure → notify via gateway's /hooks/agent API, with chrome.storage.local as fallback for when the gateway itself is down

Would love to see this merged officially so we don't have to maintain a local patch. The current UX (manual re-click after every restart) is a real pain point for anyone running a persistent setup. 🙏

@steipete
Copy link
Contributor

Thanks for the detailed work here.

Closing as superseded by newer main implementation in the same area. Current main already includes the reconnect/persistence/race-hardening set (for example reconnect race hardening, stale socket replacement guards, tab/session state handling, keepalive, and related relay tests), but on top of the newer relay/auth architecture.

Merging this branch now would effectively roll back newer changes and reintroduce divergence.

@steipete steipete closed this Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants