Skip to content

feat(chat): session recovery with retry on errors#81

Merged
Zhang-Henry merged 3 commits intomainfrom
feat/session-recovery
Mar 26, 2026
Merged

feat(chat): session recovery with retry on errors#81
Zhang-Henry merged 3 commits intomainfrom
feat/session-recovery

Conversation

@MarkSiqiZhang
Copy link
Copy Markdown
Collaborator

Summary

  • Sessions interrupted by usage limits or API errors can now be recovered instead of being lost
  • Error messages show a Retry button that pre-fills the last user message for easy re-send
  • Usage limit errors display a clean "AI usage limit reached" message

Problem

When a session was interrupted (usage limit, API error, network issue), users couldn't continue the conversation:

  1. Error messages showed raw text with no way to retry
  2. No error classification — frontend couldn't distinguish retryable errors from permanent ones

Changes

Backend (all 3 providers)

  • server/claude-sdk.js — Record session to index on early errors; classify errors (usage_limit, overloaded, network, auth) and send
    errorType/isRetryable with claude-error
  • server/openai-codex.js — Add error classification to codex-error
  • server/gemini-cli.js — Add error classification to sendGeminiError

Frontend

  • ChatMessage type — Added errorType and isRetryable fields
  • useChatRealtimeHandlers — Pass error metadata to ChatMessage for all providers; clear session timers on error to prevent stale loading state
  • MessageComponent — Render Retry button on retryable errors with usage limit hint
  • ChatInterface — Stable handleRetry callback using ref pattern (preserves React.memo)
  • formatUsageLimitText — Simplified to "AI usage limit reached. Please try again later."
  • i18n — All 3 locales (en, zh-CN, ko) updated

Test plan

  • Trigger a usage limit error → Retry button appears with hint text
  • Click Retry → last user message fills the input box (does NOT auto-send)
  • Non-retryable errors (auth) → no Retry button shown

Copy link
Copy Markdown
Collaborator

@Zhang-Henry Zhang-Henry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Good feature — error classification across all 3 providers with retry UX. Consistent approach. A few issues to address.

Issues

1. [Refactor] Error classification logic duplicated 3 times

claude-sdk.js, gemini-cli.js, and openai-codex.js contain identical regex blocks:

let errorType = 'unknown';
if (/usage[_ ]limit|rate[_ ]limit/i.test(errorMsg)) errorType = 'usage_limit';
else if (/overloaded/i.test(errorMsg)) errorType = 'overloaded';
else if (/network|ECONNREFUSED|ETIMEDOUT/i.test(errorMsg)) errorType = 'network';
else if (/\bauth\b|unauthorized|forbidden/i.test(errorMsg)) errorType = 'auth';

Should be extracted into a shared utility: classifyError(message) → { errorType, isRetryable }. Copy-pasted regex is a maintenance hazard — when a new error pattern is added, all 3 files must be updated identically.

2. [Bug/UX regression] formatUsageLimitText loses reset time information

The original code (chatFormatting.ts:133-163) parsed Claude AI usage limit reached|<timestamp> and converted the Unix timestamp to a user-friendly local time: "Your limit will reset at 14:00 GMT+8 (Shanghai) - 23 Mar 2026".

The new code replaces this with a blanket 'AI usage limit reached. Please try again later.' — users completely lose the information about when they can retry. If this is an intentional design decision, it should be called out in the PR description with rationale. Otherwise, the original timestamp parsing should be preserved.

3. [Bug] sessionStorage cleanup scope is too broad

const timerKeys = Object.keys(sessionStorage).filter((k) => k.startsWith('session_timer_start_'));
timerKeys.forEach((k) => sessionStorage.removeItem(k));
sessionStorage.removeItem('pendingSessionId');

This clears all session timers, not just the one for the errored session. If the user has multiple tabs, this could interfere with other active sessions. The fix direction is correct (error path was indeed missing pendingSessionId cleanup — the success path at claude-complete does clean it up conditionally), but should be scoped to the specific session ID.

4. [Nit] handleRetry dependency array includes unused textareaRef

const handleRetry = useCallback(() => { ... }, [setInput, textareaRef]);

textareaRef is not referenced inside the callback body. Remove it from the deps array.

5. [Design] Retry only pre-fills input, doesn't auto-send

For usage_limit errors, pre-filling makes sense (user should wait). But for transient errors like network/overloaded, users typically expect "Retry" to actually retry the request. Consider auto-sending for transient errors, or at least differentiate the button behavior.

6. [Verify] Session creation → immediate removal timing

The new logic in claude-sdk.js (lines 657-671) sends session-created on early errors, then immediately calls removeSession(capturedSessionId). The frontend receives session-created followed quickly by claude-error. Verify the frontend handles this "create then destroy" sequence gracefully — sidebar shouldn't flash or leave ghost sessions.

@MarkSiqiZhang
Copy link
Copy Markdown
Collaborator Author

MarkSiqiZhang commented Mar 25, 2026

Thanks for the thorough review! All 6 issues have been addressed:

  1. [Refactor] Error classification duplicated 3 times — Extracted into shared/errorClassifier.js. Exports classifyError() (regex-based for process-level errors) and classifySDKError() (per-provider structured error code mappings for Claude AssistantMessageError, Codex ResponseError.code, and Gemini gRPC codes). All 3 provider files now import from the shared module.

  2. [Bug/UX regression] formatUsageLimitText loses reset time — Restored. Now parses the Unix timestamp and displays local reset time with proper timezone: "Your limit will reset at 06:46 PM GMT-4 (America/New_York) - Mar 24, 2026". Also fixed GMT offset for fractional-hour timezones (e.g. UTC+5:30).

  3. [Bug] sessionStorage cleanup scope too broad — Scoped. clearSessionTimerStart was already called by markSessionsAsCompleted so the redundant call was removed. pendingSessionId is now only cleared when it matches the errored session, matching the pattern in the abort handler.

  4. [Nit] Unused textareaRef in deps — Removed.

  5. [Design] Retry only pre-fills, doesn't auto-send — Retry now always auto-resends the previous message via submitProgrammaticInput.

  6. [Verify] Session create → immediate removal timing — Verified safe. recordIndexedSession persists to disk before removeSession clears memory. Frontend session-created handler adds the session to sidebar, and claude-error handler only marks it completed (clears loading state) — doesn't remove the sidebar entry. No ghost sessions or flashing.

Additional improvements in this update:

  • Detected SDK-level streaming errors (message.error on Claude AssistantMessage, error/turn.failed events on Codex) that previously bypassed the error UI entirely
  • Fixed dual error emission in Codex — error/turn.failed events now send only codex-error, not both codex-response + codex-error
  • Broadened regex patterns to match real provider error strings (Gemini "capacity exhausted (HTTP 429)", common Node.js network errors like ECONNRESET/ENOTFOUND)

@Zhang-Henry Zhang-Henry force-pushed the feat/session-recovery branch from a54d724 to 7f9936a Compare March 26, 2026 06:43
  - Record sessions eagerly on error so they appear in sidebar even if
    the first API call fails (claude-sdk.js)
  - Classify errors (usage_limit, overloaded, network, auth) across all
    three providers and send errorType/isRetryable to frontend
  - Add Retry button on retryable error messages that pre-fills the last
    user message into the input box
  - Simplify usage limit display to "AI usage limit reached" without
    timezone-dependent reset timestamps
  - Update all 3 i18n locales (en, zh-CN, ko)
Error handler now clears all pending session_timer_start_* keys from
sessionStorage, preventing useChatSessionState from re-triggering
isLoading after clearLoadingIndicators has already run.
…ery issues

  - Extract duplicated error classification regex into shared/errorClassifier.js
    with per-provider SDK error code mappings (Claude, Codex, Gemini)
  - Detect SDK-level streaming errors (e.g. rate_limit, rate_limit_exceeded)
    that previously bypassed the error UI in Claude and Codex
  - Restore usage limit reset time display instead of generic "try again later"
  - Scope sessionStorage cleanup to errored session only (multi-tab safe)
  - Fix handleRetry to auto-resend the previous message on retry
  - Remove unused textareaRef from handleRetry dependency array
  - Fix GMT offset for fractional-hour timezones (e.g. UTC+5:30)
  - Skip codex-response for error events to prevent duplicate error messages
@Zhang-Henry Zhang-Henry force-pushed the feat/session-recovery branch from 7f9936a to fbaa8de Compare March 26, 2026 06:50
@Zhang-Henry Zhang-Henry merged commit ff4ab40 into main Mar 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants