feat(gateway): add stuck session detection by CyberSinister · Pull Request #16125 · openclaw/openclaw

CyberSinister · 2026-02-14T09:20:39Z

Summary

Adds configurable stuck session detection to the gateway. Sessions that exceed a configurable time threshold are detected and handled according to the configured action.

Problem

Long-running agent sessions can get stuck due to network issues, API hangs, or infinite loops. Currently there's no mechanism to detect or recover from these situations automatically.

Solution

Add stuckDetection config to gateway options
Track startedAt timestamp on chat run entries
Sweep every 60 seconds checking for sessions exceeding threshold
Configurable threshold (default 5 minutes) and action (log, notify, or abort)
Add entries() and size() methods to ChatRunRegistry for introspection

Configuration

{
  "gateway": {
    "stuckDetection": {
      "enabled": true,
      "thresholdMinutes": 5,
      "action": "abort"
    }
  }
}

Changes

src/config/types.gateway.ts — Add GatewayStuckDetectionConfig type
src/config/zod-schema.ts — Add zod validation schema
src/gateway/server-chat.ts — Track startedAt, add registry methods
src/gateway/server-maintenance.ts — Stuck detection sweep logic
src/gateway/server.impl.ts — Wire up config
Updated test files for new interfaces

Testing

Existing tests updated to accommodate new interface changes. Stuck detection sweep runs within the existing maintenance timer infrastructure.

Greptile Summary

Adds configurable stuck session detection to the gateway, sweeping every 60 seconds and supporting log, notify, and abort actions. Also includes a correct bug fix in chat-abort.ts where removeChatRun was called with runId instead of sessionId.

Runtime crash: server.impl.ts:433 references logGateway which is not defined in scope — will throw a ReferenceError the first time the stuck detection sweep triggers. Should be log.
Skipped entries during abort: server-maintenance.ts:131-187 iterates over live Map and Array iterators while the abort action mutates them via removeChatRun, causing entries to be skipped. Snapshot the iterables before the loop.
Good bug fix in chat-abort.ts: corrects the first argument to removeChatRun from runId to active.sessionId.

Confidence Score: 2/5

This PR has a definite runtime crash (logGateway ReferenceError) and a mutation-during-iteration bug that must be fixed before merging.
Score of 2 reflects two confirmed bugs: (1) an undefined variable reference that will crash at runtime on any stuck detection log, and (2) a collection mutation during iteration that silently skips entries when aborting stuck runs. The overall design and remaining changes are sound.
src/gateway/server.impl.ts (undefined logGateway reference) and src/gateway/server-maintenance.ts (mutation during iteration in abort path)

_{Last reviewed commit: 7cf7f98}

- Add GatewayStuckDetectionConfig to config types - Extend ChatRunEntry with startedAt timestamp for tracking - Add entries() and size() methods to ChatRunRegistry - Add stuck detection sweep in maintenance timer (60s interval) - Configurable threshold (default 5 min) and action (log/notify/abort) - Wire up in server.impl.ts with stuckDetection config Closes: session reliability improvements

greptile-apps

_{10 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-14T09:24:45Z

src/gateway/server-maintenance.ts

+      for (const [sessionId, entries] of params.chatRunRegistry.entries()) {
+        for (const entry of entries) {
+          const elapsed = now - entry.startedAt;
+          if (elapsed > thresholdMs) {
+            const elapsedMin = Math.round(elapsed / 60_000);
+            const meta = {
+              sessionId,
+              sessionKey: entry.sessionKey,
+              clientRunId: entry.clientRunId,
+              elapsedMinutes: elapsedMin,
+            };
+
+            if (action === "log" || action === "notify") {
+              params.logStuck?.warn(
+                `Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,
+                meta,
+              );
+            }
+
+            if (action === "abort") {
+              params.logStuck?.warn(
+                `Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`,
+                meta,
+              );
+              abortChatRunById(
+                {
+                  chatAbortControllers: params.chatAbortControllers,
+                  chatRunBuffers: params.chatRunBuffers,
+                  chatDeltaSentAt: params.chatDeltaSentAt,
+                  chatAbortedRuns: params.chatRunState.abortedRuns,
+                  removeChatRun: params.removeChatRun,
+                  agentRunSeq: params.agentRunSeq,
+                  broadcast: params.broadcast,
+                  nodeSendToSession: params.nodeSendToSession,
+                },
+                { runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" },
+              );


stuck run cleanup will fail for voice-triggered runs

the loop iterates [sessionId, entries] from the registry, but when calling abortChatRunById only entry.clientRunId is passed. abortChatRunById then calls removeChatRun(runId, runId, sessionKey) which looks up the registry by runId as the first parameter.

for voice-triggered runs (see server-node-events.ts:60), the registry key is sessionId but clientRunId is different. when abort tries to remove using clientRunId as the lookup key, it won't find the entry.

the stuck detection will log/abort repeatedly every 60s for the same stuck run because it's never removed from the registry.

fix: call params.removeChatRun(sessionId, entry.clientRunId, entry.sessionKey) directly before abortChatRunById to ensure cleanup uses the correct registry key.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gateway/server-maintenance.ts Line: 131:167 Comment: stuck run cleanup will fail for voice-triggered runs the loop iterates `[sessionId, entries]` from the registry, but when calling `abortChatRunById` only `entry.clientRunId` is passed. `abortChatRunById` then calls `removeChatRun(runId, runId, sessionKey)` which looks up the registry by `runId` as the first parameter. for voice-triggered runs (see `server-node-events.ts:60`), the registry key is `sessionId` but `clientRunId` is different. when abort tries to remove using `clientRunId` as the lookup key, it won't find the entry. the stuck detection will log/abort repeatedly every 60s for the same stuck run because it's never removed from the registry. fix: call `params.removeChatRun(sessionId, entry.clientRunId, entry.sessionKey)` directly before `abortChatRunById` to ensure cleanup uses the correct registry key. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-14T09:24:47Z

src/gateway/server-maintenance.ts

+            if (action === "log" || action === "notify") {
+              params.logStuck?.warn(
+                `Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,
+                meta,
+              );
+            }


"notify" action only logs warnings, same as "log" action - consider implementing actual notification logic (e.g., broadcasting to monitoring systems) to differentiate the two actions

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gateway/server-maintenance.ts Line: 143:148 Comment: `"notify"` action only logs warnings, same as `"log"` action - consider implementing actual notification logic (e.g., broadcasting to monitoring systems) to differentiate the two actions <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

- Fix removeChatRun call: use active.sessionId instead of runId as first param (fixes cleanup for voice-triggered and non-standard runs) - Differentiate 'notify' from 'log' action: broadcast stuck-run event to connected clients and send to session via node, enabling monitoring dashboards and alerting integrations

CyberSinister · 2026-02-14T13:03:18Z

Addressed both Greptile review comments in latest push:

Voice-triggered run cleanup — Now calling removeChatRun(sessionId, entry.clientRunId, entry.sessionKey) before abortChatRunById so the registry entry is removed using the correct key (sessionId, not clientRunId). This prevents stuck detection from repeatedly logging/aborting the same run.
Notify vs log differentiation — The notify action now broadcasts a stuck-run event to connected WebSocket clients and subscribed nodes, in addition to logging. This enables monitoring dashboards and node-based alerting to react to stuck runs.

greptile-apps

_{11 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-17T00:40:37Z

src/gateway/server.impl.ts

    getHealthVersion,
    refreshGatewayHealthSnapshot,
    logHealth,
+    logStuck: { warn: (msg, meta) => logGateway.warn({ ...meta }, msg) },


logGateway is not defined — ReferenceError at runtime

logGateway does not exist in this scope. The gateway logger is declared as const log = createSubsystemLogger("gateway") at line 85. Other callsites in this file pass logGateway: log (e.g., lines 485, 501), which confirms log is the intended variable. This will throw a ReferenceError the first time the stuck detection sweep triggers a log.

Suggested change

logStuck: { warn: (msg, meta) => logGateway.warn({ ...meta }, msg) },

logStuck: { warn: (msg, meta) => log.warn({ ...meta }, msg) },

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gateway/server.impl.ts Line: 433:433 Comment: **`logGateway` is not defined — ReferenceError at runtime** `logGateway` does not exist in this scope. The gateway logger is declared as `const log = createSubsystemLogger("gateway")` at line 85. Other callsites in this file pass `logGateway: log` (e.g., lines 485, 501), which confirms `log` is the intended variable. This will throw a `ReferenceError` the first time the stuck detection sweep triggers a log. ```suggestion logStuck: { warn: (msg, meta) => log.warn({ ...meta }, msg) }, ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in commit 9e82c33 — changed logGateway to log which is the correct subsystem logger variable in scope (declared at line 85).

greptile-apps · 2026-02-17T00:40:38Z

src/gateway/server-maintenance.ts

+      for (const [sessionId, entries] of params.chatRunRegistry.entries()) {
+        for (const entry of entries) {
+          const elapsed = now - entry.startedAt;
+          if (elapsed > thresholdMs) {
+            const elapsedMin = Math.round(elapsed / 60_000);
+            const meta = {
+              sessionId,
+              sessionKey: entry.sessionKey,
+              clientRunId: entry.clientRunId,
+              elapsedMinutes: elapsedMin,
+            };
+
+            if (action === "log" || action === "notify") {
+              params.logStuck?.warn(
+                `Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,
+                meta,
+              );
+            }
+
+            if (action === "notify") {
+              params.broadcast("stuck-run", {
+                sessionId,
+                sessionKey: entry.sessionKey,
+                clientRunId: entry.clientRunId,
+                elapsedMinutes: elapsedMin,
+                thresholdMinutes: stuckConfig?.thresholdMinutes ?? DEFAULT_STUCK_THRESHOLD_MINUTES,
+              });
+              if (entry.sessionKey) {
+                params.nodeSendToSession(entry.sessionKey, "stuck-run", {
+                  clientRunId: entry.clientRunId,
+                  elapsedMinutes: elapsedMin,
+                });
+              }
+            }
+
+            if (action === "abort") {
+              params.logStuck?.warn(
+                `Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`,
+                meta,
+              );
+              abortChatRunById(
+                {
+                  chatAbortControllers: params.chatAbortControllers,
+                  chatRunBuffers: params.chatRunBuffers,
+                  chatDeltaSentAt: params.chatDeltaSentAt,
+                  chatAbortedRuns: params.chatRunState.abortedRuns,
+                  removeChatRun: params.removeChatRun,
+                  agentRunSeq: params.agentRunSeq,
+                  broadcast: params.broadcast,
+                  nodeSendToSession: params.nodeSendToSession,
+                },
+                { runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" },
+              );
+            }
+          }
+        }
+      }


Mutating the registry while iterating causes skipped entries

When action === "abort", the inner call to abortChatRunById → removeChatRun mutates the underlying Map and arrays while the outer/inner for...of loops are still iterating over them. Specifically:

registry.remove() calls queue.splice(idx, 1) on the entries array — the same array the inner loop is iterating — which shifts subsequent elements and causes entries to be skipped.

If the array becomes empty, chatRunSessions.delete(sessionId) removes the map key, which can cause the Map iterator to skip unvisited keys.

Snapshot the data before iterating to avoid this:

Suggested change

for (const [sessionId, entries] of params.chatRunRegistry.entries()) {

for (const entry of entries) {

const elapsed = now - entry.startedAt;

if (elapsed > thresholdMs) {

const elapsedMin = Math.round(elapsed / 60_000);

const meta = {

sessionId,

sessionKey: entry.sessionKey,

clientRunId: entry.clientRunId,

elapsedMinutes: elapsedMin,

};

if (action === "log" || action === "notify") {

params.logStuck?.warn(

`Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,

meta,

);

}

if (action === "notify") {

params.broadcast("stuck-run", {

sessionId,

sessionKey: entry.sessionKey,

clientRunId: entry.clientRunId,

elapsedMinutes: elapsedMin,

thresholdMinutes: stuckConfig?.thresholdMinutes ?? DEFAULT_STUCK_THRESHOLD_MINUTES,

});

if (entry.sessionKey) {

params.nodeSendToSession(entry.sessionKey, "stuck-run", {

clientRunId: entry.clientRunId,

elapsedMinutes: elapsedMin,

});

}

}

if (action === "abort") {

params.logStuck?.warn(

`Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`,

meta,

);

abortChatRunById(

{

chatAbortControllers: params.chatAbortControllers,

chatRunBuffers: params.chatRunBuffers,

chatDeltaSentAt: params.chatDeltaSentAt,

chatAbortedRuns: params.chatRunState.abortedRuns,

removeChatRun: params.removeChatRun,

agentRunSeq: params.agentRunSeq,

broadcast: params.broadcast,

nodeSendToSession: params.nodeSendToSession,

},

{ runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" },

);

}

}

}

}

for (const [sessionId, entries] of [...params.chatRunRegistry.entries()]) {

for (const entry of [...entries]) {

Prompt To Fix With AI

This is a comment left during a code review. Path: src/gateway/server-maintenance.ts Line: 131:187 Comment: **Mutating the registry while iterating causes skipped entries** When `action === "abort"`, the inner call to `abortChatRunById` → `removeChatRun` mutates the underlying `Map` and arrays while the outer/inner `for...of` loops are still iterating over them. Specifically: 1. `registry.remove()` calls `queue.splice(idx, 1)` on the entries array — the same array the inner loop is iterating — which shifts subsequent elements and causes entries to be skipped. 2. If the array becomes empty, `chatRunSessions.delete(sessionId)` removes the map key, which can cause the `Map` iterator to skip unvisited keys. Snapshot the data before iterating to avoid this: ```suggestion for (const [sessionId, entries] of [...params.chatRunRegistry.entries()]) { for (const entry of [...entries]) { ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in commit 9e82c33 — now using Array.from() snapshots for both the outer Map iteration and inner array iteration before any mutation occurs. This prevents entries from being skipped when removeChatRun modifies the registry during abort.

… detection - server.impl.ts: logGateway is not in scope, use local 'log' variable - server-maintenance.ts: snapshot registry entries before iterating to prevent skipped entries when abort mutates the underlying Map/arrays Addresses Greptile review round 3 feedback on openclaw#16125

CyberSinister · 2026-02-27T07:00:44Z

Friendly bump — this has been open for ~2 weeks now. All Greptile feedback has been addressed in the latest push. Would love a maintainer review when you get a chance. Happy to make any further changes needed. 🙏

Samantha added 2 commits February 14, 2026 14:20

feat(config): add stuckDetection to gateway zod schema

4ff2a49

openclaw-barnacle bot added gateway Gateway runtime size: S labels Feb 14, 2026

greptile-apps bot reviewed Feb 14, 2026

View reviewed changes

thewilloftheshadow force-pushed the main branch from bfc1ccb to f92900f Compare February 15, 2026 18:46

steipete closed this Feb 16, 2026

steipete reopened this Feb 17, 2026

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gateway): add stuck session detection#16125

feat(gateway): add stuck session detection#16125
CyberSinister wants to merge 3 commits intoopenclaw:mainfrom
CyberSinister:feat/stuck-detection

CyberSinister commented Feb 14, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 14, 2026

Uh oh!

greptile-apps bot Feb 14, 2026

Uh oh!

CyberSinister commented Feb 14, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 17, 2026

Uh oh!

CyberSinister Feb 17, 2026

Uh oh!

greptile-apps bot Feb 17, 2026

Uh oh!

CyberSinister Feb 17, 2026

Uh oh!

CyberSinister commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	logStuck: { warn: (msg, meta) => logGateway.warn({ ...meta }, msg) },
	logStuck: { warn: (msg, meta) => log.warn({ ...meta }, msg) },

Uh oh!

Conversation

CyberSinister commented Feb 14, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Configuration

Changes

Testing

Greptile Summary

Confidence Score: 2/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

CyberSinister commented Feb 14, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

CyberSinister Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

CyberSinister Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

CyberSinister commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CyberSinister commented Feb 14, 2026 •

edited by greptile-apps bot

Loading