fix: add task-level timeout to lane queue to prevent permanent session blocking#48690
fix: add task-level timeout to lane queue to prevent permanent session blocking#48690kyletabor wants to merge 2 commits into
Conversation
…n blocking When an enqueued task's promise never settles (e.g. hung upstream API call), the lane is permanently jammed because `pump()` is never called again. Session lanes use maxConcurrent=1, so one stuck task blocks all future messages for that session with no automatic recovery — only a full gateway restart (SIGUSR1) clears the stale state. Wrap each dequeued task in `Promise.race` against a configurable timeout (default 5 minutes). When the timeout wins, reject the task with `CommandLaneTaskTimeoutError`, clean up `activeTaskIds`, log a diagnostic warning, and call `pump()` to unblock the lane. Callers can set per-task timeouts via `taskTimeoutMs` on `enqueueCommandInLane` opts. Pass `0` or `Infinity` to opt out. Closes openclaw#48488 Related: openclaw#42883, openclaw#42960, openclaw#42997, openclaw#29601 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR adds a per-task execution timeout to the lane queue's Issues found:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 171-175
Comment:
**Timeout warn fires for stale post-reset timeouts**
`diag.warn` is emitted unconditionally for every `CommandLaneTaskTimeoutError`, even when `completedCurrentGeneration` is `false` (meaning `resetAllLanes()` was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.
The warning should be gated on `completedCurrentGeneration`, or at minimum the log message should indicate that the timeout is stale:
```suggestion
if (err instanceof CommandLaneTaskTimeoutError) {
if (completedCurrentGeneration) {
diag.warn(
`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
);
} else {
diag.debug(
`lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`,
);
}
} else if (!isProbeLane) {
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 172-174
Comment:
**`active=` count is post-removal in timeout warn**
`completeTask(state, taskId, taskGeneration)` is called two lines above this warn, which deletes `taskId` from `state.activeTaskIds`. By the time the log fires, `state.activeTaskIds.size` reflects the count *after* the timed-out task was removed. For a lane with `maxConcurrent=1`, this always prints `active=0`, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: f767196 |
| if (err instanceof CommandLaneTaskTimeoutError) { | ||
| diag.warn( | ||
| `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`, | ||
| ); | ||
| } else if (!isProbeLane) { |
There was a problem hiding this comment.
Timeout warn fires for stale post-reset timeouts
diag.warn is emitted unconditionally for every CommandLaneTaskTimeoutError, even when completedCurrentGeneration is false (meaning resetAllLanes() was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.
The warning should be gated on completedCurrentGeneration, or at minimum the log message should indicate that the timeout is stale:
| if (err instanceof CommandLaneTaskTimeoutError) { | |
| diag.warn( | |
| `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`, | |
| ); | |
| } else if (!isProbeLane) { | |
| if (err instanceof CommandLaneTaskTimeoutError) { | |
| if (completedCurrentGeneration) { | |
| diag.warn( | |
| `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`, | |
| ); | |
| } else { | |
| diag.debug( | |
| `lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`, | |
| ); | |
| } | |
| } else if (!isProbeLane) { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 171-175
Comment:
**Timeout warn fires for stale post-reset timeouts**
`diag.warn` is emitted unconditionally for every `CommandLaneTaskTimeoutError`, even when `completedCurrentGeneration` is `false` (meaning `resetAllLanes()` was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.
The warning should be gated on `completedCurrentGeneration`, or at minimum the log message should indicate that the timeout is stale:
```suggestion
if (err instanceof CommandLaneTaskTimeoutError) {
if (completedCurrentGeneration) {
diag.warn(
`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
);
} else {
diag.debug(
`lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`,
);
}
} else if (!isProbeLane) {
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in bd6c346. Timeout warn is now gated on completedCurrentGeneration — stale post-reset timeouts downgrade to diag.debug with a (stale, post-reset) marker. Added test assertion for this behavior.
| diag.warn( | ||
| `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`, | ||
| ); |
There was a problem hiding this comment.
active= count is post-removal in timeout warn
completeTask(state, taskId, taskGeneration) is called two lines above this warn, which deletes taskId from state.activeTaskIds. By the time the log fires, state.activeTaskIds.size reflects the count after the timed-out task was removed. For a lane with maxConcurrent=1, this always prints active=0, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 172-174
Comment:
**`active=` count is post-removal in timeout warn**
`completeTask(state, taskId, taskGeneration)` is called two lines above this warn, which deletes `taskId` from `state.activeTaskIds`. By the time the log fires, `state.activeTaskIds.size` reflects the count *after* the timed-out task was removed. For a lane with `maxConcurrent=1`, this always prints `active=0`, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in bd6c346. Now captures activeTaskIds.size before completeTask() removal, so the timeout warning reports the pre-removal active count.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f767196ff0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| state.activeTaskIds.add(taskId); | ||
| void (async () => { | ||
| const startTime = Date.now(); | ||
| const effectiveTimeout = entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS; |
There was a problem hiding this comment.
Honor existing task runtime budgets before lane timeout
Using entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS introduces a hard 5-minute cutoff for every queued task that does not explicitly override taskTimeoutMs, but key callers (for example runEmbeddedPiAgent) enqueue lane work without that option (src/agents/pi-embedded-runner/run.ts:270-285) and existing runtime policy allows much longer runs (default agent timeout is 600s in src/agents/timeout.ts:3-25, cron jobs can be even longer). This means legitimate runs that take 5–10+ minutes now fail with CommandLaneTaskTimeoutError even though the underlying task keeps executing, which can surface false failures and allow later lane tasks to start while earlier work is still mutating shared session/job state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. Fixed in bd6c346 — increased DEFAULT_TASK_TIMEOUT_MS from 5 minutes to 15 minutes. The lane timeout is intentionally a last-resort safety net above the agent-level timeout (default 600s / 10 min in src/agents/timeout.ts), so it must be higher to avoid interfering with legitimate long-running tasks. Callers with even longer expected runtimes can pass taskTimeoutMs: 0 to opt out entirely.
…efault timeout - Gate timeout diag.warn on completedCurrentGeneration; stale post-reset timeouts downgrade to diag.debug to avoid misleading on-call noise - Capture activeTaskIds.size before completeTask() removal so the timeout warning reports the pre-removal active count - Increase DEFAULT_TASK_TIMEOUT_MS from 5 to 15 minutes — the lane timeout is a last-resort safety net above the agent-level timeout (default 600s / 10 min), so it must be higher to avoid killing legitimate long-running tasks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd6c34610c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const effectiveTimeout = entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS; | ||
| let timeoutHandle: ReturnType<typeof setTimeout> | undefined; | ||
| try { | ||
| const result = await entry.task(); | ||
| let result: unknown; | ||
| if (effectiveTimeout > 0 && effectiveTimeout < Infinity) { |
There was a problem hiding this comment.
Preserve caller timeout budgets before using lane fallback
The new fallback entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS enforces a hard 15-minute limit for every enqueue that does not pass taskTimeoutMs, which now overrides longer/no-timeout policies already defined by callers. Fresh evidence: runEmbeddedPiAgent enqueues via enqueueCommandInLane(..., task, opts) without setting taskTimeoutMs (src/agents/pi-embedded-runner/run.ts:270-285) even though it receives configurable timeoutMs (including --timeout 0 via resolveAgentTimeoutMs), and manual cron runs also enqueue without taskTimeoutMs (src/cron/service/ops.ts:558-578) despite cron's own 60-minute/disabled timeout policy (src/cron/service/timeout-policy.ts:14-25). In those cases, legitimate runs are rejected as CommandLaneTaskTimeoutError at 15 minutes while the underlying work can keep running, which can report false failures and allow later lane tasks to overlap with still-mutating prior work.
Useful? React with 👍 / 👎.
|
Codex review: needs real behavior proof before merge. Latest ClawSweeper review: 2026-05-22 09:07 UTC / May 22, 2026, 5:07 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. for the old failure mode by source inspection: a never-settling task in a PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Keep the shipped v2026.5.18 caller-owned, progress-aware queue timeout implementation, then close or supersede this stale PR and the linked issue together unless a concrete current-main reproduction appears. Do we have a high-confidence way to reproduce the issue? Yes for the old failure mode by source inspection: a never-settling task in a Is this the best way to solve the issue? No, this stale PR branch is no longer the best path because it enforces a hard queue default. Current main's caller-owned timeout plus progress-renewal design is the safer maintainable solution. Label justifications:
Full review comments:
Overall correctness: patch is incorrect What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against b3ec11b05223; fix evidence: release v2026.5.18, commit 21c5f8dc6df1. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
Real-world validation from production crash (May 25, 2026)This fix would have prevented a production outage on my Mac Mini gateway. Here's what happened: Trigger: A Chain of failure:
Evidence from live logs: Note on the generation-mismatch scenario: |
|
This pull request has been automatically marked as stale due to inactivity. |
Summary
Promise.racetimeout wrapper aroundawait entry.task()inpump()to prevent hung promises from permanently jamming session lanesCommandLaneTaskTimeoutErrorerror class for timed-out taskstaskTimeoutMsoption onenqueueCommandInLane(default: 5 minutes, pass0orInfinityto disable)lane task timed out: lane=... timeoutMs=...)timer.unref()prevents timeout timers from keeping the process aliveProblem
When an enqueued task's promise never settles (hung upstream API call, dropped WebSocket, unhandled exception),
completeTask()never runs,pump()is never called again, and the session lane is permanently blocked. Session lanes usemaxConcurrent=1, so one stuck task blocks all future messages for that session. The only recovery is a full gateway restart via SIGUSR1 (resetAllLanes()).This affects all messaging channels (WhatsApp, Telegram, Discord, webchat) and cron jobs. See #48488 for full root cause analysis with live diagnostic evidence.
Changes
src/process/command-queue.ts:CommandLaneTaskTimeoutErrorerror classDEFAULT_TASK_TIMEOUT_MSconstant (5 minutes)QueueEntrywith optionaltaskTimeoutMsfieldpump(): race each task against a timeout promise; on timeout, reject the entry, clearactiveTaskIds, log warning, and callpump()to unblockenqueueCommandInLaneopts withtaskTimeoutMssrc/process/command-queue.test.ts:taskTimeoutMs: 0), diagnostic logging, and safe interaction withresetAllLanesgeneration bumpsTest plan
pnpm buildpassespnpm test -- src/process/command-queue.test.ts— all 23 tests pass (17 existing + 6 new)pnpm check— lint/format cleanCloses #48488
Related: #42883, #42960, #42997, #29601
🤖 Generated with Claude Code