Skip to content

feat(gateway): add stuck session detection#16125

Open
CyberSinister wants to merge 3 commits intoopenclaw:mainfrom
CyberSinister:feat/stuck-detection
Open

feat(gateway): add stuck session detection#16125
CyberSinister wants to merge 3 commits intoopenclaw:mainfrom
CyberSinister:feat/stuck-detection

Conversation

@CyberSinister
Copy link

@CyberSinister CyberSinister commented Feb 14, 2026

Summary

Adds configurable stuck session detection to the gateway. Sessions that exceed a configurable time threshold are detected and handled according to the configured action.

Problem

Long-running agent sessions can get stuck due to network issues, API hangs, or infinite loops. Currently there's no mechanism to detect or recover from these situations automatically.

Solution

  • Add stuckDetection config to gateway options
  • Track startedAt timestamp on chat run entries
  • Sweep every 60 seconds checking for sessions exceeding threshold
  • Configurable threshold (default 5 minutes) and action (log, notify, or abort)
  • Add entries() and size() methods to ChatRunRegistry for introspection

Configuration

{
  "gateway": {
    "stuckDetection": {
      "enabled": true,
      "thresholdMinutes": 5,
      "action": "abort"
    }
  }
}

Changes

  • src/config/types.gateway.ts — Add GatewayStuckDetectionConfig type
  • src/config/zod-schema.ts — Add zod validation schema
  • src/gateway/server-chat.ts — Track startedAt, add registry methods
  • src/gateway/server-maintenance.ts — Stuck detection sweep logic
  • src/gateway/server.impl.ts — Wire up config
  • Updated test files for new interfaces

Testing

Existing tests updated to accommodate new interface changes. Stuck detection sweep runs within the existing maintenance timer infrastructure.

Greptile Summary

Adds configurable stuck session detection to the gateway, sweeping every 60 seconds and supporting log, notify, and abort actions. Also includes a correct bug fix in chat-abort.ts where removeChatRun was called with runId instead of sessionId.

  • Runtime crash: server.impl.ts:433 references logGateway which is not defined in scope — will throw a ReferenceError the first time the stuck detection sweep triggers. Should be log.
  • Skipped entries during abort: server-maintenance.ts:131-187 iterates over live Map and Array iterators while the abort action mutates them via removeChatRun, causing entries to be skipped. Snapshot the iterables before the loop.
  • Good bug fix in chat-abort.ts: corrects the first argument to removeChatRun from runId to active.sessionId.

Confidence Score: 2/5

  • This PR has a definite runtime crash (logGateway ReferenceError) and a mutation-during-iteration bug that must be fixed before merging.
  • Score of 2 reflects two confirmed bugs: (1) an undefined variable reference that will crash at runtime on any stuck detection log, and (2) a collection mutation during iteration that silently skips entries when aborting stuck runs. The overall design and remaining changes are sound.
  • src/gateway/server.impl.ts (undefined logGateway reference) and src/gateway/server-maintenance.ts (mutation during iteration in abort path)

Last reviewed commit: 7cf7f98

Samantha added 2 commits February 14, 2026 14:20
- Add GatewayStuckDetectionConfig to config types
- Extend ChatRunEntry with startedAt timestamp for tracking
- Add entries() and size() methods to ChatRunRegistry
- Add stuck detection sweep in maintenance timer (60s interval)
- Configurable threshold (default 5 min) and action (log/notify/abort)
- Wire up in server.impl.ts with stuckDetection config

Closes: session reliability improvements
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: S labels Feb 14, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +131 to +167
for (const [sessionId, entries] of params.chatRunRegistry.entries()) {
for (const entry of entries) {
const elapsed = now - entry.startedAt;
if (elapsed > thresholdMs) {
const elapsedMin = Math.round(elapsed / 60_000);
const meta = {
sessionId,
sessionKey: entry.sessionKey,
clientRunId: entry.clientRunId,
elapsedMinutes: elapsedMin,
};

if (action === "log" || action === "notify") {
params.logStuck?.warn(
`Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,
meta,
);
}

if (action === "abort") {
params.logStuck?.warn(
`Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`,
meta,
);
abortChatRunById(
{
chatAbortControllers: params.chatAbortControllers,
chatRunBuffers: params.chatRunBuffers,
chatDeltaSentAt: params.chatDeltaSentAt,
chatAbortedRuns: params.chatRunState.abortedRuns,
removeChatRun: params.removeChatRun,
agentRunSeq: params.agentRunSeq,
broadcast: params.broadcast,
nodeSendToSession: params.nodeSendToSession,
},
{ runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" },
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stuck run cleanup will fail for voice-triggered runs

the loop iterates [sessionId, entries] from the registry, but when calling abortChatRunById only entry.clientRunId is passed. abortChatRunById then calls removeChatRun(runId, runId, sessionKey) which looks up the registry by runId as the first parameter.

for voice-triggered runs (see server-node-events.ts:60), the registry key is sessionId but clientRunId is different. when abort tries to remove using clientRunId as the lookup key, it won't find the entry.

the stuck detection will log/abort repeatedly every 60s for the same stuck run because it's never removed from the registry.

fix: call params.removeChatRun(sessionId, entry.clientRunId, entry.sessionKey) directly before abortChatRunById to ensure cleanup uses the correct registry key.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-maintenance.ts
Line: 131:167

Comment:
stuck run cleanup will fail for voice-triggered runs

the loop iterates `[sessionId, entries]` from the registry, but when calling `abortChatRunById` only `entry.clientRunId` is passed. `abortChatRunById` then calls `removeChatRun(runId, runId, sessionKey)` which looks up the registry by `runId` as the first parameter.

for voice-triggered runs (see `server-node-events.ts:60`), the registry key is `sessionId` but `clientRunId` is different. when abort tries to remove using `clientRunId` as the lookup key, it won't find the entry.

the stuck detection will log/abort repeatedly every 60s for the same stuck run because it's never removed from the registry.

fix: call `params.removeChatRun(sessionId, entry.clientRunId, entry.sessionKey)` directly before `abortChatRunById` to ensure cleanup uses the correct registry key.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +143 to +148
if (action === "log" || action === "notify") {
params.logStuck?.warn(
`Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,
meta,
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"notify" action only logs warnings, same as "log" action - consider implementing actual notification logic (e.g., broadcasting to monitoring systems) to differentiate the two actions

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-maintenance.ts
Line: 143:148

Comment:
`"notify"` action only logs warnings, same as `"log"` action - consider implementing actual notification logic (e.g., broadcasting to monitoring systems) to differentiate the two actions

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Fix removeChatRun call: use active.sessionId instead of runId as
  first param (fixes cleanup for voice-triggered and non-standard runs)
- Differentiate 'notify' from 'log' action: broadcast stuck-run event
  to connected clients and send to session via node, enabling
  monitoring dashboards and alerting integrations
@CyberSinister
Copy link
Author

Addressed both Greptile review comments in latest push:

  1. Voice-triggered run cleanup — Now calling removeChatRun(sessionId, entry.clientRunId, entry.sessionKey) before abortChatRunById so the registry entry is removed using the correct key (sessionId, not clientRunId). This prevents stuck detection from repeatedly logging/aborting the same run.

  2. Notify vs log differentiation — The notify action now broadcasts a stuck-run event to connected WebSocket clients and subscribed nodes, in addition to logging. This enables monitoring dashboards and node-based alerting to react to stuck runs.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

getHealthVersion,
refreshGatewayHealthSnapshot,
logHealth,
logStuck: { warn: (msg, meta) => logGateway.warn({ ...meta }, msg) },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logGateway is not defined — ReferenceError at runtime

logGateway does not exist in this scope. The gateway logger is declared as const log = createSubsystemLogger("gateway") at line 85. Other callsites in this file pass logGateway: log (e.g., lines 485, 501), which confirms log is the intended variable. This will throw a ReferenceError the first time the stuck detection sweep triggers a log.

Suggested change
logStuck: { warn: (msg, meta) => logGateway.warn({ ...meta }, msg) },
logStuck: { warn: (msg, meta) => log.warn({ ...meta }, msg) },
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server.impl.ts
Line: 433:433

Comment:
**`logGateway` is not defined — ReferenceError at runtime**

`logGateway` does not exist in this scope. The gateway logger is declared as `const log = createSubsystemLogger("gateway")` at line 85. Other callsites in this file pass `logGateway: log` (e.g., lines 485, 501), which confirms `log` is the intended variable. This will throw a `ReferenceError` the first time the stuck detection sweep triggers a log.

```suggestion
    logStuck: { warn: (msg, meta) => log.warn({ ...meta }, msg) },
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 9e82c33 — changed logGateway to log which is the correct subsystem logger variable in scope (declared at line 85).

Comment on lines +131 to +187
for (const [sessionId, entries] of params.chatRunRegistry.entries()) {
for (const entry of entries) {
const elapsed = now - entry.startedAt;
if (elapsed > thresholdMs) {
const elapsedMin = Math.round(elapsed / 60_000);
const meta = {
sessionId,
sessionKey: entry.sessionKey,
clientRunId: entry.clientRunId,
elapsedMinutes: elapsedMin,
};

if (action === "log" || action === "notify") {
params.logStuck?.warn(
`Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,
meta,
);
}

if (action === "notify") {
params.broadcast("stuck-run", {
sessionId,
sessionKey: entry.sessionKey,
clientRunId: entry.clientRunId,
elapsedMinutes: elapsedMin,
thresholdMinutes: stuckConfig?.thresholdMinutes ?? DEFAULT_STUCK_THRESHOLD_MINUTES,
});
if (entry.sessionKey) {
params.nodeSendToSession(entry.sessionKey, "stuck-run", {
clientRunId: entry.clientRunId,
elapsedMinutes: elapsedMin,
});
}
}

if (action === "abort") {
params.logStuck?.warn(
`Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`,
meta,
);
abortChatRunById(
{
chatAbortControllers: params.chatAbortControllers,
chatRunBuffers: params.chatRunBuffers,
chatDeltaSentAt: params.chatDeltaSentAt,
chatAbortedRuns: params.chatRunState.abortedRuns,
removeChatRun: params.removeChatRun,
agentRunSeq: params.agentRunSeq,
broadcast: params.broadcast,
nodeSendToSession: params.nodeSendToSession,
},
{ runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" },
);
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the registry while iterating causes skipped entries

When action === "abort", the inner call to abortChatRunByIdremoveChatRun mutates the underlying Map and arrays while the outer/inner for...of loops are still iterating over them. Specifically:

  1. registry.remove() calls queue.splice(idx, 1) on the entries array — the same array the inner loop is iterating — which shifts subsequent elements and causes entries to be skipped.
  2. If the array becomes empty, chatRunSessions.delete(sessionId) removes the map key, which can cause the Map iterator to skip unvisited keys.

Snapshot the data before iterating to avoid this:

Suggested change
for (const [sessionId, entries] of params.chatRunRegistry.entries()) {
for (const entry of entries) {
const elapsed = now - entry.startedAt;
if (elapsed > thresholdMs) {
const elapsedMin = Math.round(elapsed / 60_000);
const meta = {
sessionId,
sessionKey: entry.sessionKey,
clientRunId: entry.clientRunId,
elapsedMinutes: elapsedMin,
};
if (action === "log" || action === "notify") {
params.logStuck?.warn(
`Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`,
meta,
);
}
if (action === "notify") {
params.broadcast("stuck-run", {
sessionId,
sessionKey: entry.sessionKey,
clientRunId: entry.clientRunId,
elapsedMinutes: elapsedMin,
thresholdMinutes: stuckConfig?.thresholdMinutes ?? DEFAULT_STUCK_THRESHOLD_MINUTES,
});
if (entry.sessionKey) {
params.nodeSendToSession(entry.sessionKey, "stuck-run", {
clientRunId: entry.clientRunId,
elapsedMinutes: elapsedMin,
});
}
}
if (action === "abort") {
params.logStuck?.warn(
`Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`,
meta,
);
abortChatRunById(
{
chatAbortControllers: params.chatAbortControllers,
chatRunBuffers: params.chatRunBuffers,
chatDeltaSentAt: params.chatDeltaSentAt,
chatAbortedRuns: params.chatRunState.abortedRuns,
removeChatRun: params.removeChatRun,
agentRunSeq: params.agentRunSeq,
broadcast: params.broadcast,
nodeSendToSession: params.nodeSendToSession,
},
{ runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" },
);
}
}
}
}
for (const [sessionId, entries] of [...params.chatRunRegistry.entries()]) {
for (const entry of [...entries]) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-maintenance.ts
Line: 131:187

Comment:
**Mutating the registry while iterating causes skipped entries**

When `action === "abort"`, the inner call to `abortChatRunById``removeChatRun` mutates the underlying `Map` and arrays while the outer/inner `for...of` loops are still iterating over them. Specifically:

1. `registry.remove()` calls `queue.splice(idx, 1)` on the entries array — the same array the inner loop is iterating — which shifts subsequent elements and causes entries to be skipped.
2. If the array becomes empty, `chatRunSessions.delete(sessionId)` removes the map key, which can cause the `Map` iterator to skip unvisited keys.

Snapshot the data before iterating to avoid this:

```suggestion
      for (const [sessionId, entries] of [...params.chatRunRegistry.entries()]) {
        for (const entry of [...entries]) {
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 9e82c33 — now using Array.from() snapshots for both the outer Map iteration and inner array iteration before any mutation occurs. This prevents entries from being skipped when removeChatRun modifies the registry during abort.

CyberSinister pushed a commit to CyberSinister/openclaw that referenced this pull request Feb 17, 2026
… detection

- server.impl.ts: logGateway is not in scope, use local 'log' variable
- server-maintenance.ts: snapshot registry entries before iterating to
  prevent skipped entries when abort mutates the underlying Map/arrays

Addresses Greptile review round 3 feedback on openclaw#16125
@CyberSinister
Copy link
Author

Friendly bump — this has been open for ~2 weeks now. All Greptile feedback has been addressed in the latest push. Would love a maintainer review when you get a chance. Happy to make any further changes needed. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants