fix: add task-level timeout to lane queue to prevent permanent session blocking by kyletabor · Pull Request #48690 · openclaw/openclaw

kyletabor · 2026-03-17T03:10:03Z

Summary

Add Promise.race timeout wrapper around await entry.task() in pump() to prevent hung promises from permanently jamming session lanes
New CommandLaneTaskTimeoutError error class for timed-out tasks
New taskTimeoutMs option on enqueueCommandInLane (default: 5 minutes, pass 0 or Infinity to disable)
Diagnostic warning logged when a task times out (lane task timed out: lane=... timeoutMs=...)
timer.unref() prevents timeout timers from keeping the process alive

Problem

When an enqueued task's promise never settles (hung upstream API call, dropped WebSocket, unhandled exception), completeTask() never runs, pump() is never called again, and the session lane is permanently blocked. Session lanes use maxConcurrent=1, so one stuck task blocks all future messages for that session. The only recovery is a full gateway restart via SIGUSR1 (resetAllLanes()).

This affects all messaging channels (WhatsApp, Telegram, Discord, webchat) and cron jobs. See #48488 for full root cause analysis with live diagnostic evidence.

Changes

src/process/command-queue.ts:

New CommandLaneTaskTimeoutError error class
DEFAULT_TASK_TIMEOUT_MS constant (5 minutes)
Extended QueueEntry with optional taskTimeoutMs field
pump(): race each task against a timeout promise; on timeout, reject the entry, clear activeTaskIds, log warning, and call pump() to unblock
Extended enqueueCommandInLane opts with taskTimeoutMs

src/process/command-queue.test.ts:

6 new tests covering: stuck task timeout + lane unblock, fast task completion, custom per-task timeouts, timeout disable (taskTimeoutMs: 0), diagnostic logging, and safe interaction with resetAllLanes generation bumps

Test plan

pnpm build passes
pnpm test -- src/process/command-queue.test.ts — all 23 tests pass (17 existing + 6 new)
pnpm check — lint/format clean
Deploy to staging gateway and verify webchat sessions auto-recover from stuck lanes

Closes #48488
Related: #42883, #42960, #42997, #29601

🤖 Generated with Claude Code

…n blocking When an enqueued task's promise never settles (e.g. hung upstream API call), the lane is permanently jammed because `pump()` is never called again. Session lanes use maxConcurrent=1, so one stuck task blocks all future messages for that session with no automatic recovery — only a full gateway restart (SIGUSR1) clears the stale state. Wrap each dequeued task in `Promise.race` against a configurable timeout (default 5 minutes). When the timeout wins, reject the task with `CommandLaneTaskTimeoutError`, clean up `activeTaskIds`, log a diagnostic warning, and call `pump()` to unblock the lane. Callers can set per-task timeouts via `taskTimeoutMs` on `enqueueCommandInLane` opts. Pass `0` or `Infinity` to opt out. Closes openclaw#48488 Related: openclaw#42883, openclaw#42960, openclaw#42997, openclaw#29601 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-17T03:14:18Z

Greptile Summary

This PR adds a per-task execution timeout to the lane queue's pump() loop using Promise.race, preventing a single hung promise from permanently blocking a session lane. The implementation is well-structured: CommandLaneTaskTimeoutError is a dedicated error class that callers can catch specifically, timer.unref() avoids holding the process open for idle timeout timers, the completeTask/generation-check mechanism correctly prevents double-pumping after resetAllLanes, and both success and failure paths call clearTimeout before proceeding. Six new tests cover the primary scenarios.

Issues found:

The diag.warn is emitted for all CommandLaneTaskTimeoutError instances, including stale ones that fire after resetAllLanes() has already recovered the lane. After a SIGUSR1 restart, every pre-reset stuck task will emit a lane task timed out warning even though the lane is healthy — this could mislead on-call engineers. Gating the warn on completedCurrentGeneration (or downgrading it to debug/adding a stale=true field) would prevent false-alarm noise.
The active= count in the timeout warn message is logged after completeTask() already removed the task from activeTaskIds, so it always reflects the post-removal count. For maxConcurrent=1 lanes (which is the affected session-lane configuration), this always prints active=0, making it harder to confirm from logs that the task was actually running.
taskTimeoutMs: Infinity (the other documented disable path alongside 0) has no dedicated test, though the condition in the implementation is identical to the 0 path and the risk is minimal.

Confidence Score: 4/5

Safe to merge — the core timeout mechanism is correct and well-tested; only minor diagnostic-quality issues remain.
The implementation correctly handles the critical edge cases: generation mismatches after resetAllLanes, clearTimeout on both success and failure paths, and always-settling callers' promises. The two flagged issues are diagnostic noise (stale-generation warn) and a minor log-field accuracy concern — neither affects runtime correctness. The 6 new tests provide good coverage of the timeout scenarios. Score is 4 rather than 5 due to the stale-timeout warn that could cause misleading alerts in production after a SIGUSR1 restart.
The diag.warn stale-generation path in src/process/command-queue.ts (catch block around line 171) warrants a second look before production deployment to avoid false-alarm on-call noise after gateway restarts.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 171-175

Comment:
**Timeout warn fires for stale post-reset timeouts**

`diag.warn` is emitted unconditionally for every `CommandLaneTaskTimeoutError`, even when `completedCurrentGeneration` is `false` (meaning `resetAllLanes()` was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.

The warning should be gated on `completedCurrentGeneration`, or at minimum the log message should indicate that the timeout is stale:

```suggestion
            if (err instanceof CommandLaneTaskTimeoutError) {
              if (completedCurrentGeneration) {
                diag.warn(
                  `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
                );
              } else {
                diag.debug(
                  `lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`,
                );
              }
            } else if (!isProbeLane) {
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/process/command-queue.ts
Line: 172-174

Comment:
**`active=` count is post-removal in timeout warn**

`completeTask(state, taskId, taskGeneration)` is called two lines above this warn, which deletes `taskId` from `state.activeTaskIds`. By the time the log fires, `state.activeTaskIds.size` reflects the count *after* the timed-out task was removed. For a lane with `maxConcurrent=1`, this always prints `active=0`, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: f767196}

greptile-apps · 2026-03-17T03:14:22Z

+            if (err instanceof CommandLaneTaskTimeoutError) {
+              diag.warn(
+                `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
+              );
+            } else if (!isProbeLane) {


Timeout warn fires for stale post-reset timeouts

diag.warn is emitted unconditionally for every CommandLaneTaskTimeoutError, even when completedCurrentGeneration is false (meaning resetAllLanes() was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered.

The warning should be gated on completedCurrentGeneration, or at minimum the log message should indicate that the timeout is stale:

Suggested change

if (err instanceof CommandLaneTaskTimeoutError) {

diag.warn(

`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,

);

} else if (!isProbeLane) {

if (err instanceof CommandLaneTaskTimeoutError) {

if (completedCurrentGeneration) {

diag.warn(

`lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,

);

} else {

diag.debug(

`lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`,

);

}

} else if (!isProbeLane) {

Prompt To Fix With AI

This is a comment left during a code review. Path: src/process/command-queue.ts Line: 171-175 Comment: **Timeout warn fires for stale post-reset timeouts** `diag.warn` is emitted unconditionally for every `CommandLaneTaskTimeoutError`, even when `completedCurrentGeneration` is `false` (meaning `resetAllLanes()` was already called and the lane has been fully recovered). In production this produces warning noise after a SIGUSR1 restart for every task that had a pending timeout at the time of reset, which could mislead on-call engineers into thinking the lane is still in trouble when it already recovered. The warning should be gated on `completedCurrentGeneration`, or at minimum the log message should indicate that the timeout is stale: ```suggestion if (err instanceof CommandLaneTaskTimeoutError) { if (completedCurrentGeneration) { diag.warn( `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`, ); } else { diag.debug( `lane task timed out (stale, post-reset): lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime}`, ); } } else if (!isProbeLane) { ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in bd6c346. Timeout warn is now gated on completedCurrentGeneration — stale post-reset timeouts downgrade to diag.debug with a (stale, post-reset) marker. Added test assertion for this behavior.

greptile-apps · 2026-03-17T03:14:23Z

+              diag.warn(
+                `lane task timed out: lane=${lane} timeoutMs=${effectiveTimeout} durationMs=${Date.now() - startTime} active=${state.activeTaskIds.size} queued=${state.queue.length}`,
+              );


active= count is post-removal in timeout warn

completeTask(state, taskId, taskGeneration) is called two lines above this warn, which deletes taskId from state.activeTaskIds. By the time the log fires, state.activeTaskIds.size reflects the count after the timed-out task was removed. For a lane with maxConcurrent=1, this always prints active=0, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/process/command-queue.ts Line: 172-174 Comment: **`active=` count is post-removal in timeout warn** `completeTask(state, taskId, taskGeneration)` is called two lines above this warn, which deletes `taskId` from `state.activeTaskIds`. By the time the log fires, `state.activeTaskIds.size` reflects the count *after* the timed-out task was removed. For a lane with `maxConcurrent=1`, this always prints `active=0`, making it hard to confirm the task was actually running. The successful-task debug log at line 160 has the same pattern, but since the timeout case is the anomaly you're diagnosing, capturing the pre-removal count (or noting that the reported value is post-removal) would make the message more useful. How can I resolve this? If you propose a fix, please make it concise.

Fixed in bd6c346. Now captures activeTaskIds.size before completeTask() removal, so the timeout warning reports the pre-removal active count.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f767196ff0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-17T03:16:41Z

        state.activeTaskIds.add(taskId);
        void (async () => {
          const startTime = Date.now();
+          const effectiveTimeout = entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS;


Honor existing task runtime budgets before lane timeout

Using entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS introduces a hard 5-minute cutoff for every queued task that does not explicitly override taskTimeoutMs, but key callers (for example runEmbeddedPiAgent) enqueue lane work without that option (src/agents/pi-embedded-runner/run.ts:270-285) and existing runtime policy allows much longer runs (default agent timeout is 600s in src/agents/timeout.ts:3-25, cron jobs can be even longer). This means legitimate runs that take 5–10+ minutes now fail with CommandLaneTaskTimeoutError even though the underlying task keeps executing, which can surface false failures and allow later lane tasks to start while earlier work is still mutating shared session/job state.

Useful? React with 👍 / 👎.

Good catch. Fixed in bd6c346 — increased DEFAULT_TASK_TIMEOUT_MS from 5 minutes to 15 minutes. The lane timeout is intentionally a last-resort safety net above the agent-level timeout (default 600s / 10 min in src/agents/timeout.ts), so it must be higher to avoid interfering with legitimate long-running tasks. Callers with even longer expected runtimes can pass taskTimeoutMs: 0 to opt out entirely.

…efault timeout - Gate timeout diag.warn on completedCurrentGeneration; stale post-reset timeouts downgrade to diag.debug to avoid misleading on-call noise - Capture activeTaskIds.size before completeTask() removal so the timeout warning reports the pre-removal active count - Increase DEFAULT_TASK_TIMEOUT_MS from 5 to 15 minutes — the lane timeout is a last-resort safety net above the agent-level timeout (default 600s / 10 min), so it must be higher to avoid killing legitimate long-running tasks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd6c34610c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-17T13:28:04Z

+          const effectiveTimeout = entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS;
+          let timeoutHandle: ReturnType<typeof setTimeout> | undefined;
          try {
-            const result = await entry.task();
+            let result: unknown;
+            if (effectiveTimeout > 0 && effectiveTimeout < Infinity) {


Preserve caller timeout budgets before using lane fallback

The new fallback entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS enforces a hard 15-minute limit for every enqueue that does not pass taskTimeoutMs, which now overrides longer/no-timeout policies already defined by callers. Fresh evidence: runEmbeddedPiAgent enqueues via enqueueCommandInLane(..., task, opts) without setting taskTimeoutMs (src/agents/pi-embedded-runner/run.ts:270-285) even though it receives configurable timeoutMs (including --timeout 0 via resolveAgentTimeoutMs), and manual cron runs also enqueue without taskTimeoutMs (src/cron/service/ops.ts:558-578) despite cron's own 60-minute/disabled timeout policy (src/cron/service/timeout-policy.ts:14-25). In those cases, legitimate runs are rejected as CommandLaneTaskTimeoutError at 15 minutes while the underlying work can keep running, which can report false failures and allow later lane tasks to overlap with still-mutating prior work.

Useful? React with 👍 / 👎.

clawsweeper · 2026-04-28T11:57:43Z

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-22 09:07 UTC / May 22, 2026, 5:07 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The branch adds CommandLaneTaskTimeoutError, a taskTimeoutMs enqueue option, a 15-minute default task timeout around command-lane execution, and timeout-focused command queue tests.

Reproducibility: yes. for the old failure mode by source inspection: a never-settling task in a maxConcurrent=1 lane holds the active slot and blocks queued work until timeout handling releases it. I did not run tests in this read-only review, but current main has focused coverage for the timeout release and progress-renewal paths.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Summary: Not merge-ready because real behavior proof is missing and the stale branch conflicts with the shipped caller-owned timeout design.

Rank-up moves:

Resolve this PR and the linked issue together using v2026.5.18/current-main proof.
If continuing the branch, rebase on main and preserve caller-owned timeout plus progress-renewal semantics.
Add redacted real gateway, webchat, or staging recovery proof before any renewed merge attempt.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body lists build, unit, and check commands but no after-fix real gateway/webchat recovery proof; redacted logs, terminal output, or a recording are still needed before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

Merging this stale branch over current main could replace caller-owned, progress-aware timeout budgets with a hard queue-wide default.
A timed-out lane releases the queue slot while the original task may keep running, so an incorrect default can let later work overlap with still-mutating session or cron state.
The PR body lists build, unit, and check commands but no redacted real gateway, webchat, or staging recovery proof for the hung-lane scenario.
The linked same-author issue at Lane queue has no task-level timeout — hung promises permanently block session lanes #48488 remains open, so maintainer cleanup should resolve the pair together.

Maintainer options:

Resolve The Paired Cleanup With Shipped Proof (recommended)
Close or supersede this stale branch together with the linked issue using the v2026.5.18/current-main timeout evidence.
Require A Fresh Caller-Owned Rebase
If maintainers want to continue the contribution, require a rebase that preserves taskTimeoutMs ownership and taskTimeoutProgressAtMs renewal semantics.
Explicitly Accept The Hard Default
Only merge a hard default timeout if maintainers intentionally accept the upgrade behavior and have fresh-install plus upgrade proof for long or disabled task budgets.

Next step before merge
Maintainer cleanup should close or supersede this PR together with the linked issue; there is no narrow automated repair because current main already shipped the central behavior and contributor proof is missing.

Security
Cleared: The diff touches in-process queue timing and tests only; I found no concrete security or supply-chain regression.

Review findings

[P1] Preserve caller-owned timeout budgets — src/process/command-queue.ts:48

Review details

Best possible solution:

Keep the shipped v2026.5.18 caller-owned, progress-aware queue timeout implementation, then close or supersede this stale PR and the linked issue together unless a concrete current-main reproduction appears.

Do we have a high-confidence way to reproduce the issue?

Yes for the old failure mode by source inspection: a never-settling task in a maxConcurrent=1 lane holds the active slot and blocks queued work until timeout handling releases it. I did not run tests in this read-only review, but current main has focused coverage for the timeout release and progress-renewal paths.

Is this the best way to solve the issue?

No, this stale PR branch is no longer the best path because it enforces a hard queue default. Current main's caller-owned timeout plus progress-renewal design is the safer maintainable solution.

Label justifications:

P2: This is a normal-priority queue reliability cleanup with limited remaining action because the central behavior is already shipped.
merge-risk: 🚨 compatibility: The PR branch's hard default timeout can change existing long-running or disabled-timeout workflows during upgrade.
merge-risk: 🚨 session-state: A queue timeout releases later work while the original task may still mutate session or cron state in the background.
rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🧂 unranked krab, and Not merge-ready because real behavior proof is missing and the stale branch conflicts with the shipped caller-owned timeout design.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists build, unit, and check commands but no after-fix real gateway/webchat recovery proof; redacted logs, terminal output, or a recording are still needed before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

[P1] Preserve caller-owned timeout budgets — src/process/command-queue.ts:48
Using entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS enforces a hard timeout for every caller, but current main deliberately treats missing/zero/non-finite values as no queue timeout and lets owners pass taskTimeoutMs plus progress renewal. Merging this stale diff would reject legitimate long runs and can let later queue work overlap with a still-running timed-out task.
Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.88

What I checked:

Current main has opt-in, caller-owned queue task timeouts: runQueueEntryTask only races the task when entry.taskTimeoutMs normalizes to a finite positive value; missing, non-finite, zero, and negative values preserve the legacy no-queue-timeout path. (src/process/command-queue.ts:272, b3ec11b05223)
Current main keeps embedded-run timeout budgets with progress renewal: Embedded runs derive a lane timeout from the run timeout plus grace and pass taskTimeoutProgressAtMs, so active progress renews the queue watchdog instead of using a hard global default. (src/agents/pi-embedded-runner/run.ts:218, b3ec11b05223)
Current tests cover timeout release and progress renewal: The command queue test suite covers a stuck task timing out and draining queued work, plus renewal from progress timestamps and fallback when the progress callback throws. (src/process/command-queue.test.ts:470, b3ec11b05223)
PR branch still uses a hard default queue timeout: The PR diff adds DEFAULT_TASK_TIMEOUT_MS and computes entry.taskTimeoutMs ?? DEFAULT_TASK_TIMEOUT_MS, which would enforce a queue timeout even when callers have not opted in. (src/process/command-queue.ts:48, bd6c34610c68)
Implementation provenance: Commit 470098b introduced the current CommandLaneTaskTimeoutError, taskTimeoutMs, queue timeout tests, and embedded-run lane timeout wiring. (src/process/command-queue.ts, 470098bd26f3)
Progress-aware follow-up provenance: Commit 21c5f8d added taskTimeoutProgressAtMs and threaded run progress into embedded-run lane timeouts. (src/process/command-queue.ts, 21c5f8dc6df1)

Likely related people:

steipete: Authored the current shipped queue timeout implementation and the progress-aware follow-up used by embedded runs. (role: current implementation owner; confidence: high; commits: 470098bd26f3, 21c5f8dc6df1, c500e8704f4e; files: src/process/command-queue.ts, src/process/command-queue.test.ts, src/agents/pi-embedded-runner/run.ts)
vincentkoc: Recent history credits Vincent in the shipped changelog entry and shows related work on shared command queue/runtime state and diagnostics. (role: runtime queue contributor; confidence: medium; commits: 4ca84acf240d, d262b1c68810, ec1f72b6c58f; files: src/process/command-queue.ts, src/process/command-queue.test.ts, src/shared/global-singleton.ts)
galiniliev: Recent queue-adjacent commits changed lane wait diagnostics and session turn priority on the same command queue path. (role: recent adjacent contributor; confidence: medium; commits: 18812bfc039a, 7c151b212bff; files: src/process/command-queue.ts, src/process/command-queue.test.ts, src/agents/pi-embedded-runner/run.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against b3ec11b05223; fix evidence: release v2026.5.18, commit 21c5f8dc6df1.

clawsweeper · 2026-05-19T23:21:42Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

moeedahmed · 2026-05-25T23:30:28Z

Real-world validation from production crash (May 25, 2026)

This fix would have prevented a production outage on my Mac Mini gateway. Here's what happened:

Trigger: A skill-workshop-review cron task fired shortly after a gateway restart. Its bash tool call via Codex app-server sent the request but the response was lost (Codex was still in auth re-warm from the restart). The task's promise never settled.

Chain of failure:

The skill-workshop-review task was enqueued in its session lane
The bash tool call never completed → task promise never resolved/rejected
completeTask() never ran → activeTaskIds entry stayed forever
Lane has maxConcurrent: 1 → all subsequent work was blocked
The diagnostic recovery at diagnostic-stuck-session-recovery.runtime.ts:302-318 skipped recovery because activeCount > 0 on the lane
The stuck item aged to 71 minutes before I force-restarted

Evidence from live logs:

skill-workshop-review-c815151c... idle/tool_call,q=1,age=4272s last=embedded_run:ended
activeTool=bash activeToolCallId=call_U5R5s82eTHs98YeBMw484mK1 activeToolAge=2254s

Note on the generation-mismatch scenario:
There's a related case where completeTask() returns false silently due to generation mismatch (line 266-267 in command-queue.ts), leaving the task ID in activeTaskIds but the task actually finished. This is distinct from the hung-promise case this PR handles — it's a cleanup bug in completeTask(). A stale-lane sweeper that checks idle time on the lane state would catch both cases. But the per-task timeout in this PR covers the hung-promise case which was the direct cause of this outage. This PR alone would have prevented the crash.

openclaw-barnacle · 2026-06-09T04:48:41Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

openclaw-barnacle Bot added the size: M label Mar 17, 2026

kyletabor mentioned this pull request Mar 17, 2026

Lane queue has no task-level timeout — hung promises permanently block session lanes #48488

Open

greptile-apps Bot reviewed Mar 17, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 17, 2026

View reviewed changes

clawsweeper Bot mentioned this pull request Apr 26, 2026

[Feature]: Decoupling Lane Queue and Asynchronous Transformation #51178

Closed

openclaw-clownfish Bot mentioned this pull request Apr 29, 2026

feat: add maxExecutionMs per-task timeout to command lane #68381

Closed

clawsweeper Bot mentioned this pull request May 13, 2026

feat: add circuit breaker for command lane saturation #68785

Open

clawsweeper Bot added the P2 Normal backlog priority with limited blast radius. label May 17, 2026

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 17, 2026

openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Jun 9, 2026

Uh oh!

Conversation

kyletabor commented Mar 17, 2026

Summary

Problem

Changes

Test plan

Uh oh!

greptile-apps Bot commented Mar 17, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kyletabor Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kyletabor Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kyletabor Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moeedahmed commented May 25, 2026

Real-world validation from production crash (May 25, 2026)

Uh oh!

openclaw-barnacle Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clawsweeper Bot commented Apr 28, 2026 •

edited

Loading

clawsweeper Bot commented May 19, 2026 •

edited

Loading