feat(gateway): add stuck session detection#16125
feat(gateway): add stuck session detection#16125CyberSinister wants to merge 3 commits intoopenclaw:mainfrom
Conversation
- Add GatewayStuckDetectionConfig to config types - Extend ChatRunEntry with startedAt timestamp for tracking - Add entries() and size() methods to ChatRunRegistry - Add stuck detection sweep in maintenance timer (60s interval) - Configurable threshold (default 5 min) and action (log/notify/abort) - Wire up in server.impl.ts with stuckDetection config Closes: session reliability improvements
| for (const [sessionId, entries] of params.chatRunRegistry.entries()) { | ||
| for (const entry of entries) { | ||
| const elapsed = now - entry.startedAt; | ||
| if (elapsed > thresholdMs) { | ||
| const elapsedMin = Math.round(elapsed / 60_000); | ||
| const meta = { | ||
| sessionId, | ||
| sessionKey: entry.sessionKey, | ||
| clientRunId: entry.clientRunId, | ||
| elapsedMinutes: elapsedMin, | ||
| }; | ||
|
|
||
| if (action === "log" || action === "notify") { | ||
| params.logStuck?.warn( | ||
| `Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`, | ||
| meta, | ||
| ); | ||
| } | ||
|
|
||
| if (action === "abort") { | ||
| params.logStuck?.warn( | ||
| `Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`, | ||
| meta, | ||
| ); | ||
| abortChatRunById( | ||
| { | ||
| chatAbortControllers: params.chatAbortControllers, | ||
| chatRunBuffers: params.chatRunBuffers, | ||
| chatDeltaSentAt: params.chatDeltaSentAt, | ||
| chatAbortedRuns: params.chatRunState.abortedRuns, | ||
| removeChatRun: params.removeChatRun, | ||
| agentRunSeq: params.agentRunSeq, | ||
| broadcast: params.broadcast, | ||
| nodeSendToSession: params.nodeSendToSession, | ||
| }, | ||
| { runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" }, | ||
| ); |
There was a problem hiding this comment.
stuck run cleanup will fail for voice-triggered runs
the loop iterates [sessionId, entries] from the registry, but when calling abortChatRunById only entry.clientRunId is passed. abortChatRunById then calls removeChatRun(runId, runId, sessionKey) which looks up the registry by runId as the first parameter.
for voice-triggered runs (see server-node-events.ts:60), the registry key is sessionId but clientRunId is different. when abort tries to remove using clientRunId as the lookup key, it won't find the entry.
the stuck detection will log/abort repeatedly every 60s for the same stuck run because it's never removed from the registry.
fix: call params.removeChatRun(sessionId, entry.clientRunId, entry.sessionKey) directly before abortChatRunById to ensure cleanup uses the correct registry key.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-maintenance.ts
Line: 131:167
Comment:
stuck run cleanup will fail for voice-triggered runs
the loop iterates `[sessionId, entries]` from the registry, but when calling `abortChatRunById` only `entry.clientRunId` is passed. `abortChatRunById` then calls `removeChatRun(runId, runId, sessionKey)` which looks up the registry by `runId` as the first parameter.
for voice-triggered runs (see `server-node-events.ts:60`), the registry key is `sessionId` but `clientRunId` is different. when abort tries to remove using `clientRunId` as the lookup key, it won't find the entry.
the stuck detection will log/abort repeatedly every 60s for the same stuck run because it's never removed from the registry.
fix: call `params.removeChatRun(sessionId, entry.clientRunId, entry.sessionKey)` directly before `abortChatRunById` to ensure cleanup uses the correct registry key.
How can I resolve this? If you propose a fix, please make it concise.| if (action === "log" || action === "notify") { | ||
| params.logStuck?.warn( | ||
| `Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`, | ||
| meta, | ||
| ); | ||
| } |
There was a problem hiding this comment.
"notify" action only logs warnings, same as "log" action - consider implementing actual notification logic (e.g., broadcasting to monitoring systems) to differentiate the two actions
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-maintenance.ts
Line: 143:148
Comment:
`"notify"` action only logs warnings, same as `"log"` action - consider implementing actual notification logic (e.g., broadcasting to monitoring systems) to differentiate the two actions
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Fix removeChatRun call: use active.sessionId instead of runId as first param (fixes cleanup for voice-triggered and non-standard runs) - Differentiate 'notify' from 'log' action: broadcast stuck-run event to connected clients and send to session via node, enabling monitoring dashboards and alerting integrations
|
Addressed both Greptile review comments in latest push:
|
bfc1ccb to
f92900f
Compare
| getHealthVersion, | ||
| refreshGatewayHealthSnapshot, | ||
| logHealth, | ||
| logStuck: { warn: (msg, meta) => logGateway.warn({ ...meta }, msg) }, |
There was a problem hiding this comment.
logGateway is not defined — ReferenceError at runtime
logGateway does not exist in this scope. The gateway logger is declared as const log = createSubsystemLogger("gateway") at line 85. Other callsites in this file pass logGateway: log (e.g., lines 485, 501), which confirms log is the intended variable. This will throw a ReferenceError the first time the stuck detection sweep triggers a log.
| logStuck: { warn: (msg, meta) => logGateway.warn({ ...meta }, msg) }, | |
| logStuck: { warn: (msg, meta) => log.warn({ ...meta }, msg) }, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server.impl.ts
Line: 433:433
Comment:
**`logGateway` is not defined — ReferenceError at runtime**
`logGateway` does not exist in this scope. The gateway logger is declared as `const log = createSubsystemLogger("gateway")` at line 85. Other callsites in this file pass `logGateway: log` (e.g., lines 485, 501), which confirms `log` is the intended variable. This will throw a `ReferenceError` the first time the stuck detection sweep triggers a log.
```suggestion
logStuck: { warn: (msg, meta) => log.warn({ ...meta }, msg) },
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in commit 9e82c33 — changed logGateway to log which is the correct subsystem logger variable in scope (declared at line 85).
| for (const [sessionId, entries] of params.chatRunRegistry.entries()) { | ||
| for (const entry of entries) { | ||
| const elapsed = now - entry.startedAt; | ||
| if (elapsed > thresholdMs) { | ||
| const elapsedMin = Math.round(elapsed / 60_000); | ||
| const meta = { | ||
| sessionId, | ||
| sessionKey: entry.sessionKey, | ||
| clientRunId: entry.clientRunId, | ||
| elapsedMinutes: elapsedMin, | ||
| }; | ||
|
|
||
| if (action === "log" || action === "notify") { | ||
| params.logStuck?.warn( | ||
| `Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`, | ||
| meta, | ||
| ); | ||
| } | ||
|
|
||
| if (action === "notify") { | ||
| params.broadcast("stuck-run", { | ||
| sessionId, | ||
| sessionKey: entry.sessionKey, | ||
| clientRunId: entry.clientRunId, | ||
| elapsedMinutes: elapsedMin, | ||
| thresholdMinutes: stuckConfig?.thresholdMinutes ?? DEFAULT_STUCK_THRESHOLD_MINUTES, | ||
| }); | ||
| if (entry.sessionKey) { | ||
| params.nodeSendToSession(entry.sessionKey, "stuck-run", { | ||
| clientRunId: entry.clientRunId, | ||
| elapsedMinutes: elapsedMin, | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| if (action === "abort") { | ||
| params.logStuck?.warn( | ||
| `Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`, | ||
| meta, | ||
| ); | ||
| abortChatRunById( | ||
| { | ||
| chatAbortControllers: params.chatAbortControllers, | ||
| chatRunBuffers: params.chatRunBuffers, | ||
| chatDeltaSentAt: params.chatDeltaSentAt, | ||
| chatAbortedRuns: params.chatRunState.abortedRuns, | ||
| removeChatRun: params.removeChatRun, | ||
| agentRunSeq: params.agentRunSeq, | ||
| broadcast: params.broadcast, | ||
| nodeSendToSession: params.nodeSendToSession, | ||
| }, | ||
| { runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" }, | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Mutating the registry while iterating causes skipped entries
When action === "abort", the inner call to abortChatRunById → removeChatRun mutates the underlying Map and arrays while the outer/inner for...of loops are still iterating over them. Specifically:
registry.remove()callsqueue.splice(idx, 1)on the entries array — the same array the inner loop is iterating — which shifts subsequent elements and causes entries to be skipped.- If the array becomes empty,
chatRunSessions.delete(sessionId)removes the map key, which can cause theMapiterator to skip unvisited keys.
Snapshot the data before iterating to avoid this:
| for (const [sessionId, entries] of params.chatRunRegistry.entries()) { | |
| for (const entry of entries) { | |
| const elapsed = now - entry.startedAt; | |
| if (elapsed > thresholdMs) { | |
| const elapsedMin = Math.round(elapsed / 60_000); | |
| const meta = { | |
| sessionId, | |
| sessionKey: entry.sessionKey, | |
| clientRunId: entry.clientRunId, | |
| elapsedMinutes: elapsedMin, | |
| }; | |
| if (action === "log" || action === "notify") { | |
| params.logStuck?.warn( | |
| `Stuck run detected: ${entry.clientRunId} (${elapsedMin}m)`, | |
| meta, | |
| ); | |
| } | |
| if (action === "notify") { | |
| params.broadcast("stuck-run", { | |
| sessionId, | |
| sessionKey: entry.sessionKey, | |
| clientRunId: entry.clientRunId, | |
| elapsedMinutes: elapsedMin, | |
| thresholdMinutes: stuckConfig?.thresholdMinutes ?? DEFAULT_STUCK_THRESHOLD_MINUTES, | |
| }); | |
| if (entry.sessionKey) { | |
| params.nodeSendToSession(entry.sessionKey, "stuck-run", { | |
| clientRunId: entry.clientRunId, | |
| elapsedMinutes: elapsedMin, | |
| }); | |
| } | |
| } | |
| if (action === "abort") { | |
| params.logStuck?.warn( | |
| `Aborting stuck run: ${entry.clientRunId} (${elapsedMin}m)`, | |
| meta, | |
| ); | |
| abortChatRunById( | |
| { | |
| chatAbortControllers: params.chatAbortControllers, | |
| chatRunBuffers: params.chatRunBuffers, | |
| chatDeltaSentAt: params.chatDeltaSentAt, | |
| chatAbortedRuns: params.chatRunState.abortedRuns, | |
| removeChatRun: params.removeChatRun, | |
| agentRunSeq: params.agentRunSeq, | |
| broadcast: params.broadcast, | |
| nodeSendToSession: params.nodeSendToSession, | |
| }, | |
| { runId: entry.clientRunId, sessionKey: entry.sessionKey, stopReason: "stuck" }, | |
| ); | |
| } | |
| } | |
| } | |
| } | |
| for (const [sessionId, entries] of [...params.chatRunRegistry.entries()]) { | |
| for (const entry of [...entries]) { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-maintenance.ts
Line: 131:187
Comment:
**Mutating the registry while iterating causes skipped entries**
When `action === "abort"`, the inner call to `abortChatRunById` → `removeChatRun` mutates the underlying `Map` and arrays while the outer/inner `for...of` loops are still iterating over them. Specifically:
1. `registry.remove()` calls `queue.splice(idx, 1)` on the entries array — the same array the inner loop is iterating — which shifts subsequent elements and causes entries to be skipped.
2. If the array becomes empty, `chatRunSessions.delete(sessionId)` removes the map key, which can cause the `Map` iterator to skip unvisited keys.
Snapshot the data before iterating to avoid this:
```suggestion
for (const [sessionId, entries] of [...params.chatRunRegistry.entries()]) {
for (const entry of [...entries]) {
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in commit 9e82c33 — now using Array.from() snapshots for both the outer Map iteration and inner array iteration before any mutation occurs. This prevents entries from being skipped when removeChatRun modifies the registry during abort.
… detection - server.impl.ts: logGateway is not in scope, use local 'log' variable - server-maintenance.ts: snapshot registry entries before iterating to prevent skipped entries when abort mutates the underlying Map/arrays Addresses Greptile review round 3 feedback on openclaw#16125
|
Friendly bump — this has been open for ~2 weeks now. All Greptile feedback has been addressed in the latest push. Would love a maintainer review when you get a chance. Happy to make any further changes needed. 🙏 |
Summary
Adds configurable stuck session detection to the gateway. Sessions that exceed a configurable time threshold are detected and handled according to the configured action.
Problem
Long-running agent sessions can get stuck due to network issues, API hangs, or infinite loops. Currently there's no mechanism to detect or recover from these situations automatically.
Solution
stuckDetectionconfig to gateway optionsstartedAttimestamp on chat run entrieslog,notify, orabort)entries()andsize()methods to ChatRunRegistry for introspectionConfiguration
{ "gateway": { "stuckDetection": { "enabled": true, "thresholdMinutes": 5, "action": "abort" } } }Changes
src/config/types.gateway.ts— AddGatewayStuckDetectionConfigtypesrc/config/zod-schema.ts— Add zod validation schemasrc/gateway/server-chat.ts— TrackstartedAt, add registry methodssrc/gateway/server-maintenance.ts— Stuck detection sweep logicsrc/gateway/server.impl.ts— Wire up configTesting
Existing tests updated to accommodate new interface changes. Stuck detection sweep runs within the existing maintenance timer infrastructure.
Greptile Summary
Adds configurable stuck session detection to the gateway, sweeping every 60 seconds and supporting
log,notify, andabortactions. Also includes a correct bug fix inchat-abort.tswhereremoveChatRunwas called withrunIdinstead ofsessionId.server.impl.ts:433referenceslogGatewaywhich is not defined in scope — will throw aReferenceErrorthe first time the stuck detection sweep triggers. Should belog.server-maintenance.ts:131-187iterates over liveMapandArrayiterators while theabortaction mutates them viaremoveChatRun, causing entries to be skipped. Snapshot the iterables before the loop.chat-abort.ts: corrects the first argument toremoveChatRunfromrunIdtoactive.sessionId.Confidence Score: 2/5
logGatewayReferenceError) and a mutation-during-iteration bug that must be fixed before merging.src/gateway/server.impl.ts(undefinedlogGatewayreference) andsrc/gateway/server-maintenance.ts(mutation during iteration in abort path)Last reviewed commit: 7cf7f98