Skip to content

fix: add task-level timeout to lane queue to prevent permanent session blocking#48690

Open
kyletabor wants to merge 2 commits into
openclaw:mainfrom
kyletabor:fix/lane-task-timeout
Open

fix: add task-level timeout to lane queue to prevent permanent session blocking#48690
kyletabor wants to merge 2 commits into
openclaw:mainfrom
kyletabor:fix/lane-task-timeout

Conversation

@kyletabor

Copy link
Copy Markdown

Summary

  • Add Promise.race timeout wrapper around await entry.task() in pump() to prevent hung promises from permanently jamming session lanes
  • New CommandLaneTaskTimeoutError error class for timed-out tasks
  • New taskTimeoutMs option on enqueueCommandInLane (default: 5 minutes, pass 0 or Infinity to disable)
  • Diagnostic warning logged when a task times out (lane task timed out: lane=... timeoutMs=...)
  • timer.unref() prevents timeout timers from keeping the process alive

Problem

When an enqueued task's promise never settles (hung upstream API call, dropped WebSocket, unhandled exception), completeTask() never runs, pump() is never called again, and the session lane is permanently blocked. Session lanes use maxConcurrent=1, so one stuck task blocks all future messages for that session. The only recovery is a full gateway restart via SIGUSR1 (resetAllLanes()).

This affects all messaging channels (WhatsApp, Telegram, Discord, webchat) and cron jobs. See #48488 for full root cause analysis with live diagnostic evidence.

Changes

src/process/command-queue.ts:

  • New CommandLaneTaskTimeoutError error class
  • DEFAULT_TASK_TIMEOUT_MS constant (5 minutes)
  • Extended QueueEntry with optional taskTimeoutMs field
  • pump(): race each task against a timeout promise; on timeout, reject the entry, clear activeTaskIds, log warning, and call pump() to unblock
  • Extended enqueueCommandInLane opts with taskTimeoutMs

src/process/command-queue.test.ts:

  • 6 new tests covering: stuck task timeout + lane unblock, fast task completion, custom per-task timeouts, timeout disable (taskTimeoutMs: 0), diagnostic logging, and safe interaction with resetAllLanes generation bumps

Test plan

  • pnpm build passes
  • pnpm test -- src/process/command-queue.test.ts — all 23 tests pass (17 existing + 6 new)
  • pnpm check — lint/format clean
  • Deploy to staging gateway and verify webchat sessions auto-recover from stuck lanes

Closes #48488
Related: #42883, #42960, #42997, #29601

🤖 Generated with Claude Code

…n blocking

When an enqueued task's promise never settles (e.g. hung upstream API
call), the lane is permanently jammed because `pump()` is never called
again. Session lanes use maxConcurrent=1, so one stuck task blocks all
future messages for that session with no automatic recovery — only a
full gateway restart (SIGUSR1) clears the stale state.

Wrap each dequeued task in `Promise.race` against a configurable timeout
(default 5 minutes). When the timeout wins, reject the task with
`CommandLaneTaskTimeoutError`, clean up `activeTaskIds`, log a
diagnostic warning, and call `pump()` to unblock the lane.

Callers can set per-task timeouts via `taskTimeoutMs` on
`enqueueCommandInLane` opts. Pass `0` or `Infinity` to opt out.

Closes openclaw#48488
Related: openclaw#42883, openclaw#42960, openclaw#42997, openclaw#29601

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a per-task execution timeout to the lane queue's pump() loop using Promise.race, preventing a single hung promise from permanently blocking a session lane. The implementation is well-structured: CommandLaneTaskTimeoutError is a dedicated error class that callers can catch specifically, timer.unref() avoids holding the process open for idle timeout timers, the completeTask/generation-check mechanism correctly prevents double-pumping after resetAllLanes, and both success and failure paths call clearTimeout before proceeding. Six new tests cover the primary scenarios.

Issues found:

  • The diag.warn is emitted for all CommandLaneTaskTimeoutError instances, including stale ones that fire after resetAllLanes() has already recovered the lane. After a SIGUSR1 restart, every pre-reset stuck task will emit a lane task timed out warning even though the lane is healthy — this could mislead on-call engineers. Gating the warn on completedCurrentGeneration (or downgrading it to debug/adding a stale=true field) would prevent false-alarm noise.

  • The active= count in the timeout warn message is logged after completeTask() already removed the task from activeTaskIds, so it always reflects the post-removal count. For maxConcurrent=1 lanes (which is the affected session-lane configuration), this always prints active=0, making it harder to confirm from logs that the task was actually running.

  • taskTimeoutMs: Infinity (the other documented disable path alongside 0) has no dedicated test, though the condition in the implementation is identical to the 0 path and the risk is minimal.

Confidence Score: 4/5

  • Safe to merge — the core timeout mechanism is correct and well-tested; only minor diagnostic-quality issues remain.
  • The implementation correctly handles the critical edge cases: generation mismatches after resetAllLanes, clearTimeout on both success and failure paths, and always-settling callers' promises. The two flagged issues are diagnostic noise (stale-generation warn) and a minor log-field accuracy concern — neither affects runtime correctness. The 6 new tests provide good coverage of the timeout scenarios. Score is 4 rather than 5 due to the stale-timeout warn that could cause misleading alerts in production after a SIGUSR1 restart.
  • The diag.warn stale-generation path in src/process/command-queue.ts (catch block around line 171) warrants a second look before production deployment to avoid false-alarm on-call noise after gateway restarts.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 171-175

Comment:
**Timeout warn fires for stale post-reset timeouts**

`diag.warn` is emitted unconditionally for every `CommandLaneTaskTimeoutError`, even when `completedCurrentGeneration` is `false` (meaning `resetAllLanes()` was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.

The warning should be gated on `completedCurrentGeneration`, or at minimum the log message should indicate that the timeout is stale:

```suggestion
            if (err instanceof CommandLaneTaskTimeoutError) {
              if (completedCurrentGeneration) {
                diag.warn(
                  `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
                );
              } else {
                diag.debug(
                  `lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`,
                );
              }
            } else if (!isProbeLane) {
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 172-174

Comment:
**`active=` count is post-removal in timeout warn**

`completeTask(state, taskId, taskGeneration)` is called two lines above this warn, which deletes `taskId` from `state.activeTaskIds`. By the time the log fires, `state.activeTaskIds.size` reflects the count *after* the timed-out task was removed. For a lane with `maxConcurrent=1`, this always prints `active=0`, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: f767196

Comment on lines +171 to +175
if (err instanceof CommandLaneTaskTimeoutError) {
diag.warn(
`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
);
} else if (!isProbeLane) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Timeout warn fires for stale post-reset timeouts

diag.warn is emitted unconditionally for every CommandLaneTaskTimeoutError, even when completedCurrentGeneration is false (meaning resetAllLanes() was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.

The warning should be gated on completedCurrentGeneration, or at minimum the log message should indicate that the timeout is stale:

Suggested change
if (err instanceof CommandLaneTaskTimeoutError) {
diag.warn(
`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
);
} else if (!isProbeLane) {
if (err instanceof CommandLaneTaskTimeoutError) {
if (completedCurrentGeneration) {
diag.warn(
`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
);
} else {
diag.debug(
`lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`,
);
}
} else if (!isProbeLane) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 171-175

Comment:
**Timeout warn fires for stale post-reset timeouts**

`diag.warn` is emitted unconditionally for every `CommandLaneTaskTimeoutError`, even when `completedCurrentGeneration` is `false` (meaning `resetAllLanes()` was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.

The warning should be gated on `completedCurrentGeneration`, or at minimum the log message should indicate that the timeout is stale:

```suggestion
            if (err instanceof CommandLaneTaskTimeoutError) {
              if (completedCurrentGeneration) {
                diag.warn(
                  `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
                );
              } else {
                diag.debug(
                  `lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`,
                );
              }
            } else if (!isProbeLane) {
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd6c346. Timeout warn is now gated on completedCurrentGeneration — stale post-reset timeouts downgrade to diag.debug with a (stale, post-reset) marker. Added test assertion for this behavior.

Comment thread src/process/command-queue.ts Outdated
Comment on lines +172 to +174
diag.warn(
`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 active= count is post-removal in timeout warn

completeTask(state, taskId, taskGeneration) is called two lines above this warn, which deletes taskId from state.activeTaskIds. By the time the log fires, state.activeTaskIds.size reflects the count after the timed-out task was removed. For a lane with maxConcurrent=1, this always prints active=0, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 172-174

Comment:
**`active=` count is post-removal in timeout warn**

`completeTask(state, taskId, taskGeneration)` is called two lines above this warn, which deletes `taskId` from `state.activeTaskIds`. By the time the log fires, `state.activeTaskIds.size` reflects the count *after* the timed-out task was removed. For a lane with `maxConcurrent=1`, this always prints `active=0`, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd6c346. Now captures activeTaskIds.size before completeTask() removal, so the timeout warning reports the pre-removal active count.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f767196ff0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

state.activeTaskIds.add(taskId);
void (async () => {
const startTime = Date.now();
const effectiveTimeout = entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor existing task runtime budgets before lane timeout

Using entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS introduces a hard 5-minute cutoff for every queued task that does not explicitly override taskTimeoutMs, but key callers (for example runEmbeddedPiAgent) enqueue lane work without that option (src/agents/pi-embedded-runner/run.ts:270-285) and existing runtime policy allows much longer runs (default agent timeout is 600s in src/agents/timeout.ts:3-25, cron jobs can be even longer). This means legitimate runs that take 5–10+ minutes now fail with CommandLaneTaskTimeoutError even though the underlying task keeps executing, which can surface false failures and allow later lane tasks to start while earlier work is still mutating shared session/job state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in bd6c346 — increased DEFAULT_TASK_TIMEOUT_MS from 5 minutes to 15 minutes. The lane timeout is intentionally a last-resort safety net above the agent-level timeout (default 600s / 10 min in src/agents/timeout.ts), so it must be higher to avoid interfering with legitimate long-running tasks. Callers with even longer expected runtimes can pass taskTimeoutMs: 0 to opt out entirely.

…efault timeout

- Gate timeout diag.warn on completedCurrentGeneration; stale post-reset
  timeouts downgrade to diag.debug to avoid misleading on-call noise
- Capture activeTaskIds.size before completeTask() removal so the
  timeout warning reports the pre-removal active count
- Increase DEFAULT_TASK_TIMEOUT_MS from 5 to 15 minutes — the lane
  timeout is a last-resort safety net above the agent-level timeout
  (default 600s / 10 min), so it must be higher to avoid killing
  legitimate long-running tasks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd6c34610c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +142 to +146
const effectiveTimeout = entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS;
let timeoutHandle: ReturnType<typeof setTimeout> | undefined;
try {
const result = await entry.task();
let result: unknown;
if (effectiveTimeout > 0 && effectiveTimeout < Infinity) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve caller timeout budgets before using lane fallback

The new fallback entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS enforces a hard 15-minute limit for every enqueue that does not pass taskTimeoutMs, which now overrides longer/no-timeout policies already defined by callers. Fresh evidence: runEmbeddedPiAgent enqueues via enqueueCommandInLane(..., task, opts) without setting taskTimeoutMs (src/agents/pi-embedded-runner/run.ts:270-285) even though it receives configurable timeoutMs (including --timeout 0 via resolveAgentTimeoutMs), and manual cron runs also enqueue without taskTimeoutMs (src/cron/service/ops.ts:558-578) despite cron's own 60-minute/disabled timeout policy (src/cron/service/timeout-policy.ts:14-25). In those cases, legitimate runs are rejected as CommandLaneTaskTimeoutError at 15 minutes while the underlying work can keep running, which can report false failures and allow later lane tasks to overlap with still-mutating prior work.

Useful? React with 👍 / 👎.

@clawsweeper

clawsweeper Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-22 09:07 UTC / May 22, 2026, 5:07 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The branch adds CommandLaneTaskTimeoutError, a taskTimeoutMs enqueue option, a 15-minute default task timeout around command-lane execution, and timeout-focused command queue tests.

Reproducibility: yes. for the old failure mode by source inspection: a never-settling task in a maxConcurrent=1 lane holds the active slot and blocks queued work until timeout handling releases it. I did not run tests in this read-only review, but current main has focused coverage for the timeout release and progress-renewal paths.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Summary: Not merge-ready because real behavior proof is missing and the stale branch conflicts with the shipped caller-owned timeout design.

Rank-up moves:

  • Resolve this PR and the linked issue together using v2026.5.18/current-main proof.
  • If continuing the branch, rebase on main and preserve caller-owned timeout plus progress-renewal semantics.
  • Add redacted real gateway, webchat, or staging recovery proof before any renewed merge attempt.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body lists build, unit, and check commands but no after-fix real gateway/webchat recovery proof; redacted logs, terminal output, or a recording are still needed before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • Merging this stale branch over current main could replace caller-owned, progress-aware timeout budgets with a hard queue-wide default.
  • A timed-out lane releases the queue slot while the original task may keep running, so an incorrect default can let later work overlap with still-mutating session or cron state.
  • The PR body lists build, unit, and check commands but no redacted real gateway, webchat, or staging recovery proof for the hung-lane scenario.
  • The linked same-author issue at Lane queue has no task-level timeout — hung promises permanently block session lanes #48488 remains open, so maintainer cleanup should resolve the pair together.

Maintainer options:

  1. Resolve The Paired Cleanup With Shipped Proof (recommended)
    Close or supersede this stale branch together with the linked issue using the v2026.5.18/current-main timeout evidence.
  2. Require A Fresh Caller-Owned Rebase
    If maintainers want to continue the contribution, require a rebase that preserves taskTimeoutMs ownership and taskTimeoutProgressAtMs renewal semantics.
  3. Explicitly Accept The Hard Default
    Only merge a hard default timeout if maintainers intentionally accept the upgrade behavior and have fresh-install plus upgrade proof for long or disabled task budgets.

Next step before merge
Maintainer cleanup should close or supersede this PR together with the linked issue; there is no narrow automated repair because current main already shipped the central behavior and contributor proof is missing.

Security
Cleared: The diff touches in-process queue timing and tests only; I found no concrete security or supply-chain regression.

Review findings

  • [P1] Preserve caller-owned timeout budgets — src/process/command-queue.ts:48
Review details

Best possible solution:

Keep the shipped v2026.5.18 caller-owned, progress-aware queue timeout implementation, then close or supersede this stale PR and the linked issue together unless a concrete current-main reproduction appears.

Do we have a high-confidence way to reproduce the issue?

Yes for the old failure mode by source inspection: a never-settling task in a maxConcurrent=1 lane holds the active slot and blocks queued work until timeout handling releases it. I did not run tests in this read-only review, but current main has focused coverage for the timeout release and progress-renewal paths.

Is this the best way to solve the issue?

No, this stale PR branch is no longer the best path because it enforces a hard queue default. Current main's caller-owned timeout plus progress-renewal design is the safer maintainable solution.

Label justifications:

  • P2: This is a normal-priority queue reliability cleanup with limited remaining action because the central behavior is already shipped.
  • merge-risk: 🚨 compatibility: The PR branch's hard default timeout can change existing long-running or disabled-timeout workflows during upgrade.
  • merge-risk: 🚨 session-state: A queue timeout releases later work while the original task may still mutate session or cron state in the background.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🧂 unranked krab, and Not merge-ready because real behavior proof is missing and the stale branch conflicts with the shipped caller-owned timeout design.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists build, unit, and check commands but no after-fix real gateway/webchat recovery proof; redacted logs, terminal output, or a recording are still needed before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

  • [P1] Preserve caller-owned timeout budgets — src/process/command-queue.ts:48
    Using entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS enforces a hard timeout for every caller, but current main deliberately treats missing/zero/non-finite values as no queue timeout and lets owners pass taskTimeoutMs plus progress renewal. Merging this stale diff would reject legitimate long runs and can let later queue work overlap with a still-running timed-out task.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.88

What I checked:

  • Current main has opt-in, caller-owned queue task timeouts: runQueueEntryTask only races the task when entry.taskTimeoutMs normalizes to a finite positive value; missing, non-finite, zero, and negative values preserve the legacy no-queue-timeout path. (src/process/command-queue.ts:272, b3ec11b05223)
  • Current main keeps embedded-run timeout budgets with progress renewal: Embedded runs derive a lane timeout from the run timeout plus grace and pass taskTimeoutProgressAtMs, so active progress renews the queue watchdog instead of using a hard global default. (src/agents/pi-embedded-runner/run.ts:218, b3ec11b05223)
  • Current tests cover timeout release and progress renewal: The command queue test suite covers a stuck task timing out and draining queued work, plus renewal from progress timestamps and fallback when the progress callback throws. (src/process/command-queue.test.ts:470, b3ec11b05223)
  • PR branch still uses a hard default queue timeout: The PR diff adds DEFAULT_TASK_TIMEOUT_MS and computes entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS, which would enforce a queue timeout even when callers have not opted in. (src/process/command-queue.ts:48, bd6c34610c68)
  • Implementation provenance: Commit 470098b introduced the current CommandLaneTaskTimeoutError, taskTimeoutMs, queue timeout tests, and embedded-run lane timeout wiring. (src/process/command-queue.ts, 470098bd26f3)
  • Progress-aware follow-up provenance: Commit 21c5f8d added taskTimeoutProgressAtMs and threaded run progress into embedded-run lane timeouts. (src/process/command-queue.ts, 21c5f8dc6df1)

Likely related people:

  • steipete: Authored the current shipped queue timeout implementation and the progress-aware follow-up used by embedded runs. (role: current implementation owner; confidence: high; commits: 470098bd26f3, 21c5f8dc6df1, c500e8704f4e; files: src/process/command-queue.ts, src/process/command-queue.test.ts, src/agents/pi-embedded-runner/run.ts)
  • vincentkoc: Recent history credits Vincent in the shipped changelog entry and shows related work on shared command queue/runtime state and diagnostics. (role: runtime queue contributor; confidence: medium; commits: 4ca84acf240d, d262b1c68810, ec1f72b6c58f; files: src/process/command-queue.ts, src/process/command-queue.test.ts, src/shared/global-singleton.ts)
  • galiniliev: Recent queue-adjacent commits changed lane wait diagnostics and session turn priority on the same command queue path. (role: recent adjacent contributor; confidence: medium; commits: 18812bfc039a, 7c151b212bff; files: src/process/command-queue.ts, src/process/command-queue.test.ts, src/agents/pi-embedded-runner/run.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against b3ec11b05223; fix evidence: release v2026.5.18, commit 21c5f8dc6df1.

@clawsweeper clawsweeper Bot added the P2 Normal backlog priority with limited blast radius. label May 17, 2026
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 17, 2026
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 18, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@moeedahmed

Copy link
Copy Markdown
Contributor

Real-world validation from production crash (May 25, 2026)

This fix would have prevented a production outage on my Mac Mini gateway. Here's what happened:

Trigger: A skill-workshop-review cron task fired shortly after a gateway restart. Its bash tool call via Codex app-server sent the request but the response was lost (Codex was still in auth re-warm from the restart). The task's promise never settled.

Chain of failure:

  1. The skill-workshop-review task was enqueued in its session lane
  2. The bash tool call never completed → task promise never resolved/rejected
  3. completeTask() never ran → activeTaskIds entry stayed forever
  4. Lane has maxConcurrent: 1 → all subsequent work was blocked
  5. The diagnostic recovery at diagnostic-stuck-session-recovery.runtime.ts:302-318 skipped recovery because activeCount > 0 on the lane
  6. The stuck item aged to 71 minutes before I force-restarted

Evidence from live logs:

skill-workshop-review-c815151c... idle/tool_call,q=1,age=4272s last=embedded_run:ended
activeTool=bash activeToolCallId=call_U5R5s82eTHs98YeBMw484mK1 activeToolAge=2254s

Note on the generation-mismatch scenario:
There's a related case where completeTask() returns false silently due to generation mismatch (line 266-267 in command-queue.ts), leaving the task ID in activeTaskIds but the task actually finished. This is distinct from the hung-promise case this PR handles — it's a cleanup bug in completeTask(). A stale-lane sweeper that checks idle time on the lane state would catch both cases. But the per-task timeout in this PR covers the hung-promise case which was the direct cause of this outage. This PR alone would have prevented the crash.

@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: M stale Marked as stale due to inactivity status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lane queue has no task-level timeout — hung promises permanently block session lanes

2 participants