feat(plugin): rate-limit-circuit-breaker to prevent death loops in group chats by guci314 · Pull Request #64031 · openclaw/openclaw

guci314 · 2026-04-10T02:35:48Z

Summary

Adds a new community plugin rate-limit-circuit-breaker that prevents infinite message loops in multi-agent group chats when LLM API rate limits are hit
Uses a standard circuit breaker pattern (closed → open → half-open) with exponential backoff
Hooks into message_sending to suppress repeated rate-limit error messages per room

Problem

When multiple agents share a group chat with requireMention: false, a single rate-limit surface_error cascades into an infinite loop:

Agent A calls LLM API → 429 rate limit → gateway surfaces error to chat
Agent B sees error message → tries to respond → also rate-limited → surfaces error
5 agents × continuous cycle = 6.3 error messages/minute, indefinitely

Observed in production: 146 surface_error events in 13 minutes across 5 agents sharing one Matrix room and one zhipu API key. The LLM was never invoked — the loop was entirely in the gateway's error surfacing path. No amount of LLM intelligence can break this loop because the LLM is never called.

Solution

The plugin intercepts outgoing messages via the message_sending hook:

CLOSED: Counts consecutive rate-limit errors per room. First N-1 errors pass through (agents can see the error).
OPEN: On the Nth error, suppresses all further error messages for a cooldown period. Normal messages still flow.
HALF_OPEN: After cooldown, allows one retry. Success → reset. Failure → re-open with 2× cooldown.

Configurable via plugins.entries in openclaw.json:

"rate-limit-circuit-breaker": {
  "enabled": true,
  "config": {
    "maxConsecutiveErrors": 3,
    "baseCooldownMs": 60000,
    "maxCooldownMs": 600000
  }
}

Verification

After deployment (combined with fallback model config):

surface_error events: 146 → 0
Death loops: 13+ minutes → none
Normal agent communication: unaffected

Test plan

22 unit tests covering all state transitions, exponential backoff, room isolation, cleanup
End-to-end verification in production Matrix group with 5 agents
Confirmed fallback model failover works as primary defense
Confirmed circuit breaker activates as secondary defense when all fallbacks exhausted

🤖 Generated with Claude Code

…n group chats When multiple agents share a group chat with requireMention:false, a rate-limit error surfaced by one agent triggers other agents to respond, causing them to also hit rate limits in an infinite cascade. This plugin detects consecutive rate-limit error messages per room and suppresses them using a circuit breaker pattern with exponential backoff. Verified in production: a 5-agent Matrix group hit a 13+ minute death loop (146 surface_error events) before this fix. After deployment, zero surface_error events with the same rate limit conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-04-10T02:39:38Z

Greptile Summary

This PR adds a new bundled plugin, rate-limit-circuit-breaker, that intercepts outgoing message_sending events to suppress repeated rate-limit error messages using a standard closed/open/half-open circuit breaker with exponential backoff. The state machine logic and tests are well-implemented.

Hook priority is inverted (P1): { priority: -100 } causes this hook to run last among message_sending handlers (the system sorts descending). The comment says "Run early" but the sign should be positive (e.g. +100) to actually pre-empt other hooks. Message delivery is still suppressed, but the stated isolation benefit is not achieved.
cleanup() leaks half_open entries (P2): rooms that go quiet in half_open state are never evicted by the cleanup job.

Confidence Score: 4/5

Safe to merge after fixing the inverted hook priority; the end-to-end suppression works but the priority sign is clearly wrong and contradicts its own comment.

One P1 finding (hook priority -100 vs. intended +100) means other message_sending hooks run before cancellation fires, defeating the stated isolation goal. The core functionality still works end-to-end, but the intent/code mismatch warrants a fix before landing.

extensions/rate-limit-circuit-breaker/index.ts (priority bug), extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts (cleanup half_open gap)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 64

Comment:
**Hook priority is backwards — this hook runs last, not first**

The hook system sorts by descending priority (`(b.priority ?? 0) - (a.priority ?? 0)` in `src/plugins/hooks.ts:192`), so higher numbers run first. With `priority: -100` this hook runs *after* all default-priority (0) hooks, meaning other `message_sending` hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.

Message delivery is still suppressed (the final `hookResult.cancelled` check in `deliver.ts` respects `cancel: true` regardless of order), but the side-effect isolation benefit is lost.

```suggestion
    { priority: 100 }, // Run early so we can cancel before other hooks process
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts
Line: 224-235

Comment:
**`half_open` entries are never cleaned up**

`cleanup()` removes `closed`+zero-error rooms and `open` rooms whose `openedAt` is older than `maxAgeMs`, but `half_open` rooms are silently skipped. A room that transitions to `half_open` (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The `openedAt` timestamp survives the `open → half_open` transition and can be used for staleness here.

```suggestion
  cleanup(maxAgeMs: number = 3_600_000): void {
    const now = Date.now();
    for (const [key, room] of this.rooms) {
      if (room.state === "closed" && room.consecutiveErrors === 0) {
        this.rooms.delete(key);
        continue;
      }
      if (
        (room.state === "open" || room.state === "half_open") &&
        now - room.openedAt > maxAgeMs
      ) {
        this.rooms.delete(key);
      }
    }
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 27

Comment:
**`api: any` violates the strict-typing project rule**

Per the project coding guidelines, `any` should be avoided and extension production code should import from `openclaw/plugin-sdk/*`. The typed entry point (`PluginEntryApi` or equivalent from `openclaw/plugin-sdk/plugin-entry`) would give proper inference on `api.on`, `api.pluginConfig`, `api.logger`, and `api.registerGatewayMethod`, catching mismatched hook signatures at compile time rather than at runtime.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat(plugin): add rate-limit-circuit-bre..." | Re-trigger Greptile}

greptile-apps · 2026-04-10T02:39:41Z

+      // Allow the message through (no modification)
+      return undefined;
+    },
+    { priority: -100 }, // Run early so we can cancel before other hooks process


Hook priority is backwards — this hook runs last, not first

The hook system sorts by descending priority ((b.priority ?? 0) - (a.priority ?? 0) in src/plugins/hooks.ts:192), so higher numbers run first. With priority: -100 this hook runs after all default-priority (0) hooks, meaning other message_sending hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.

Message delivery is still suppressed (the final hookResult.cancelled check in deliver.ts respects cancel: true regardless of order), but the side-effect isolation benefit is lost.

Suggested change

{ priority: -100 }, // Run early so we can cancel before other hooks process

{ priority: 100 }, // Run early so we can cancel before other hooks process

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/rate-limit-circuit-breaker/index.ts Line: 64 Comment: **Hook priority is backwards — this hook runs last, not first** The hook system sorts by descending priority (`(b.priority ?? 0) - (a.priority ?? 0)` in `src/plugins/hooks.ts:192`), so higher numbers run first. With `priority: -100` this hook runs *after* all default-priority (0) hooks, meaning other `message_sending` hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted. Message delivery is still suppressed (the final `hookResult.cancelled` check in `deliver.ts` respects `cancel: true` regardless of order), but the side-effect isolation benefit is lost. ```suggestion { priority: 100 }, // Run early so we can cancel before other hooks process ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-10T02:39:42Z

+  cleanup(maxAgeMs: number = 3_600_000): void {
+    const now = Date.now();
+    for (const [key, room] of this.rooms) {
+      if (room.state === "closed" && room.consecutiveErrors === 0) {
+        this.rooms.delete(key);
+        continue;
+      }
+      if (room.state === "open" && now - room.openedAt > maxAgeMs) {
+        this.rooms.delete(key);
+      }
+    }
+  }


half_open entries are never cleaned up

cleanup() removes closed+zero-error rooms and open rooms whose openedAt is older than maxAgeMs, but half_open rooms are silently skipped. A room that transitions to half_open (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The openedAt timestamp survives the open → half_open transition and can be used for staleness here.

Suggested change

cleanup(maxAgeMs: number = 3_600_000): void {

const now = Date.now();

for (const [key, room] of this.rooms) {

if (room.state === "closed" && room.consecutiveErrors === 0) {

this.rooms.delete(key);

continue;

}

if (room.state === "open" && now - room.openedAt > maxAgeMs) {

this.rooms.delete(key);

}

}

}

cleanup(maxAgeMs: number = 3_600_000): void {

const now = Date.now();

for (const [key, room] of this.rooms) {

if (room.state === "closed" && room.consecutiveErrors === 0) {

this.rooms.delete(key);

continue;

}

if (

(room.state === "open" || room.state === "half_open") &&

now - room.openedAt > maxAgeMs

) {

this.rooms.delete(key);

}

}

}

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts Line: 224-235 Comment: **`half_open` entries are never cleaned up** `cleanup()` removes `closed`+zero-error rooms and `open` rooms whose `openedAt` is older than `maxAgeMs`, but `half_open` rooms are silently skipped. A room that transitions to `half_open` (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The `openedAt` timestamp survives the `open → half_open` transition and can be used for staleness here. ```suggestion cleanup(maxAgeMs: number = 3_600_000): void { const now = Date.now(); for (const [key, room] of this.rooms) { if (room.state === "closed" && room.consecutiveErrors === 0) { this.rooms.delete(key); continue; } if ( (room.state === "open" || room.state === "half_open") && now - room.openedAt > maxAgeMs ) { this.rooms.delete(key); } } } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-10T02:39:43Z

+// Periodic cleanup interval handle
+let cleanupInterval: ReturnType<typeof setInterval> | null = null;
+
+export default function register(api: any) {


api: any violates the strict-typing project rule

Per the project coding guidelines, any should be avoided and extension production code should import from openclaw/plugin-sdk/*. The typed entry point (PluginEntryApi or equivalent from openclaw/plugin-sdk/plugin-entry) would give proper inference on api.on, api.pluginConfig, api.logger, and api.registerGatewayMethod, catching mismatched hook signatures at compile time rather than at runtime.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/rate-limit-circuit-breaker/index.ts Line: 27 Comment: **`api: any` violates the strict-typing project rule** Per the project coding guidelines, `any` should be avoided and extension production code should import from `openclaw/plugin-sdk/*`. The typed entry point (`PluginEntryApi` or equivalent from `openclaw/plugin-sdk/plugin-entry`) would give proper inference on `api.on`, `api.pluginConfig`, `api.logger`, and `api.registerGatewayMethod`, catching mismatched hook signatures at compile time rather than at runtime. How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0fcd5faac6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-10T02:41:34Z

+ */
+const RATE_LIMIT_PATTERNS: RegExp[] = [
+  /API rate limit reached/i,
+  /rate limit/i,


Narrow rate-limit detection to real surfaced error copy

The matcher /rate limit/i treats ordinary assistant text as an error signal, so normal replies like “here’s how to handle API rate limits” increment consecutiveErrors. After three such legitimate messages in the same room, the breaker opens and starts canceling non-error conversation. This creates false positives that can suppress real user-visible responses; the pattern set should be restricted to explicit surfaced error formats instead of broad phrase matches.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-10T02:41:34Z

+      // Allow the message through (no modification)
+      return undefined;
+    },
+    { priority: -100 }, // Run early so we can cancel before other hooks process


Raise hook priority so cancellation runs before other hooks

message_sending hooks are executed in descending priority order (higher first), so priority: -100 runs late, not early. That means other default-priority hooks can execute side effects before this plugin cancels the message, which defeats the stated goal of short-circuiting repeated error sends; for example, other plugins can still perform outbound checks/actions on messages that should have been dropped.

Useful? React with 👍 / 👎.

steipete · 2026-04-25T15:54:59Z

Closing this as better suited for ClawHub/community plugin work after Codex review.

Close as clawhub: this is a useful plugin idea, but current OpenClaw policy and docs say optional community plugins should ship externally via ClawHub/npm rather than land as new bundled plugins in core. Main already exposes the message_sending hook and typed plugin-entry API needed to implement it outside the repo.

What I checked:

Repo vision prefers external plugin distribution: VISION.md says the preferred plugin path is npm/local development in the plugin's own repository, that the bar for adding optional plugins to core is intentionally high, and that plugin discovery/review belongs in ClawHub. (VISION.md:68, 5b8bd6371c66)
Public plugin docs say not to add new plugins to this repo: docs/plugins/building-plugins.md states: contributors do not need to add plugins to the OpenClaw repository and should publish to ClawHub or npm instead; it also documents the external-plugin publish/install flow. (docs/plugins/building-plugins.md:16, 5b8bd6371c66)
Current main already exposes the needed hook seam: docs/plugins/hooks.md documents message_sending as the outbound hook for rewriting or cancelling delivery, with descending priority and terminal cancel: true behavior. (docs/plugins/hooks.md:52, 5b8bd6371c66)
Hook ordering and cancellation behavior are implemented on main: Hook registrations are sorted by descending priority in src/plugins/hooks.ts, and outbound delivery skips send when hookResult.cancelled is set in src/infra/outbound/deliver.ts. A regression test in src/infra/outbound/deliver.test.ts verifies higher-priority message_sending cancellation short-circuits lower-priority hooks. (src/plugins/hooks.ts:185, 5b8bd6371c66)
Typed external plugin entrypoint already exists on main: src/plugin-sdk/plugin-entry.ts exports definePluginEntry(...) with register: (api: OpenClawPluginApi) => void, which is the supported non-channel plugin seam for publishing this as an external plugin instead of bundling it in core. (src/plugin-sdk/plugin-entry.ts:168, 5b8bd6371c66)
Existing PR review found branch-local correctness issues: The provided GitHub review context already calls out an inverted hook priority, a half_open cleanup leak, and false-positive rate-limit matching risk on PR head 0fcd5faac6877aa8904b8bdf0902da84d7cc1f9f, reinforcing that this is not a maintainer-ready bundled-plugin merge as submitted. (0fcd5faac687)

So I’m closing this as a scope-fit item for the plugin/community path rather than keeping it open as an OpenClaw core request.

Review notes: reviewed against 5b8bd6371c66.

openclaw-barnacle Bot added the size: L label Apr 10, 2026

greptile-apps Bot reviewed Apr 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 10, 2026

View reviewed changes

guci314 mentioned this pull request Apr 11, 2026

bug(matrix): requireMention does not detect m.mentions from non-OpenClaw senders #64785

Closed

sk7n4k3d mentioned this pull request Apr 20, 2026

Gateway-level model identity not passed to delivery hooks (enables silent model-tag forgery) #69110

Open

steipete closed this Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(plugin): rate-limit-circuit-breaker to prevent death loops in group chats#64031

feat(plugin): rate-limit-circuit-breaker to prevent death loops in group chats#64031
guci314 wants to merge 1 commit into
openclaw:mainfrom
guci314:feat/rate-limit-circuit-breaker

guci314 commented Apr 10, 2026

Uh oh!

greptile-apps Bot commented Apr 10, 2026

Uh oh!

greptile-apps Bot Apr 10, 2026

Uh oh!

greptile-apps Bot Apr 10, 2026

Uh oh!

greptile-apps Bot Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 10, 2026

Uh oh!

steipete commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	{ priority: -100 }, // Run early so we can cancel before other hooks process
	{ priority: 100 }, // Run early so we can cancel before other hooks process

Uh oh!

Conversation

guci314 commented Apr 10, 2026

Summary

Problem

Solution

Verification

Test plan

Uh oh!

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

steipete commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants