Skip to content

fix(channels): add circuit breaker to typing keepalive loop#27593

Closed
xilanhua12138 wants to merge 1 commit intoopenclaw:mainfrom
xilanhua12138:fix/typing-keepalive-circuit-breaker
Closed

fix(channels): add circuit breaker to typing keepalive loop#27593
xilanhua12138 wants to merge 1 commit intoopenclaw:mainfrom
xilanhua12138:fix/typing-keepalive-circuit-breaker

Conversation

@xilanhua12138
Copy link

Summary

When the typing indicator start() call fails (e.g., the target message has been deleted or the API returns an error), the keepalive loop previously continued firing requests every 3 seconds indefinitely. This caused cascading issues:

  • Repeated requests to a non-existent resource
  • API rate limiting (observed as Feishu 429 errors that blocked all subsequent API calls)
  • Unnecessary load on both the gateway and upstream APIs

This PR adds a circuit breaker to createTypingCallbacks:

  • Tracks consecutive start() failures
  • Stops the keepalive loop after 2 consecutive failures (configurable via maxConsecutiveFailures)
  • Resets the failure counter on any successful call or when a new reply starts

Changes

  • src/channels/typing.ts — Added consecutiveFailures counter and maxConsecutiveFailures option (default: 2). The keepalive loop is stopped when the threshold is reached.
  • src/channels/typing.test.ts — Added 3 test cases:
    • Circuit breaker trips after consecutive failures
    • Failure counter resets on intermittent success (flaky network)
    • Failure counter resets when a new reply starts

Test plan

  • All existing typing tests pass
  • New circuit breaker tests pass (8/8 total)
  • Verified with pnpm vitest run src/channels/typing.test.ts

When the typing indicator start() call fails (e.g., message deleted,
API errors), the keepalive loop previously continued firing indefinitely.
This could trigger API rate limiting (e.g., Feishu 429) by sending
requests every 3 seconds for a resource that no longer exists.

Add a consecutive failure counter that stops the keepalive loop after
2 consecutive failures (configurable via maxConsecutiveFailures). The
counter resets on any successful call or when a new reply starts.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21b913b366

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +45 to 48
consecutiveFailures = 0;
keepaliveLoop.stop();
await fireStart();
keepaliveLoop.start();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip keepalive restart after breaker trips on first failure

When maxConsecutiveFailures is configured to 1 (or any threshold reached by the initial fireStart()), fireStart() can trip the breaker but onReplyStart() still unconditionally calls keepaliveLoop.start(). That restarts periodic retries even though the breaker condition has already been met, so the new circuit-breaker contract is violated for this input and failing channels continue issuing requests until a later tick stops them again.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Added circuit breaker to createTypingCallbacks to prevent cascading failures when typing indicator API calls fail repeatedly. The implementation tracks consecutive start() failures and stops the keepalive loop after 2 consecutive failures (configurable via maxConsecutiveFailures). The failure counter intelligently resets on success (handling flaky networks) or when a new reply starts.

Key changes:

  • Added consecutiveFailures counter and maxConsecutiveFailures parameter (default: 2) to src/channels/typing.ts
  • Circuit breaker trips when consecutive failures reach threshold, preventing cascading rate-limit errors (like Feishu 429s)
  • Three comprehensive test cases verify correct behavior in different failure scenarios
  • Backward compatible: optional parameter with sensible default, no breaking changes to existing call sites

Confidence Score: 5/5

  • This PR is safe to merge with no concerns
  • Clean implementation with comprehensive test coverage that solves a real production issue (API rate limiting). The circuit breaker logic is correct, well-tested (3 new test cases covering failure scenarios, intermittent success, and counter resets), and backward compatible. No logical errors, security issues, or breaking changes found.
  • No files require special attention

Last reviewed commit: 21b913b

@steipete
Copy link
Contributor

Superseded by main commit 37a138c.

Closed in favor of the integrated cluster landing. The final merge includes:

  • shared typing keepalive consecutive-failure breaker,
  • fix for the threshold edge case (no restart when breaker trips on first start),
  • cross-channel/internal-webchat suppression at dispatch,
  • and enforced dispatcher lifecycle cleanup in extension callsites.

Resolved issues in the unified landing:

Thank you; the circuit-breaker direction was incorporated and tightened in the final merged patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants