Skip to content

fix: resolve four silent failure modes in WebSocket communication#154

Merged
Zhang-Henry merged 1 commit intoOpenLAIR:mainfrom
liuyixin-louis:fix/p0-stability-silent-failures
Apr 9, 2026
Merged

fix: resolve four silent failure modes in WebSocket communication#154
Zhang-Henry merged 1 commit intoOpenLAIR:mainfrom
liuyixin-louis:fix/p0-stability-silent-failures

Conversation

@liuyixin-louis
Copy link
Copy Markdown
Collaborator

Summary

Fixes four P0 stability issues identified during architecture brainstorm analysis:

  • Provider crash error reporting.catch() blocks across all 6 providers (7 sites) now send error messages to the browser via WebSocket instead of only console.error(). The frontend already handles claude-error, cursor-error, codex-error, gemini-error, openrouter-error, localgpu-error — was just missing the server-side writer.send() calls. Previously, users saw an infinite loading spinner when a provider crashed mid-stream.

  • Session-busy notifications — When a concurrent request hits an already-active session, the server now sends a session-busy message instead of silently dropping the request. Added frontend handler to display inline notification (7 guard sites + 1 new switch case).

  • WebSocket exponential backoff — Replaces fixed 3-second reconnect interval with exponential backoff: 3s → 6s → 12s → 24s → 30s (cap). Resets on successful connection. Prevents hammering the server during extended outages.

  • Database indexes — Adds last_activity and composite (project_name, last_activity) indexes to session_metadata table. The hot query in getSessionsByProjects() sorts by last_activity DESC but previously had no index, requiring full table scan.

Files Changed

File Change
server/index.js +46 lines: error sends in 7 catch blocks + busy sends in 7 guard blocks
src/components/chat/hooks/useChatRealtimeHandlers.ts +11 lines: session-busy handler
src/contexts/WebSocketContext.tsx +9 lines: exponential backoff with retry counter
server/database/db.js +8 lines: two CREATE INDEX IF NOT EXISTS in migrations

Test plan

  • Start a chat session, kill the server mid-stream → verify error appears in chat instead of infinite spinner
  • Send a message while Claude is still responding → verify "session is still processing" inline message
  • Stop server, observe browser DevTools network tab → verify reconnect intervals increase (3s, 6s, 12s...)
  • Restart server → verify reconnect resets to 3s on next disconnect
  • Check SQLite PRAGMA index_list(session_metadata) includes new indexes after restart

🤖 Generated with Claude Code

1. Provider crash error reporting: .catch() blocks now send error
   messages to the browser instead of only logging to server console.
   Frontend already handles all error types — was just missing the
   server-side writer.send() calls. (7 catch blocks across 6 providers)

2. Session-busy notifications: concurrent requests to active sessions
   now return a 'session-busy' message instead of silently dropping.
   Added frontend handler to show inline notification. (7 guards + 1 handler)

3. WebSocket exponential backoff: replaces fixed 3s reconnect with
   3s → 6s → 12s → 24s → 30s (cap). Resets on successful connection.

4. Database indexes: adds last_activity and composite (project_name,
   last_activity) indexes to session_metadata for sort query performance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Zhang-Henry Zhang-Henry merged commit 873984b into OpenLAIR:main Apr 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants