fix(gateway): preserve stale channel restart diagnostics#90937
Conversation
|
Codex review: passed. Reviewed June 8, 2026, 1:19 AM ET / 05:19 UTC. Summary PR surface: Source +56, Tests +78. Total +134 across 2 files. Reproducibility: yes. Source inspection and the PR's before-fix regression show the sequence: non-manual stop timeout, restart request while the stale task remains, then a late Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the gateway-owned lifecycle fix with the regression tests, keeping the stale-task filtering channel-agnostic and accepting or separately proving the remaining live Telegram/LaunchAgent availability risk. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection and the PR's before-fix regression show the sequence: non-manual stop timeout, restart request while the stale task remains, then a late Is this the best way to solve the issue? Yes. The PR fixes the owner boundary in the gateway manager rather than adding Telegram-specific logic or health-monitor workarounds, and the added stop-hook regression protects the main compatibility concern I saw. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against b8adc11977ab. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +56, Tests +78. Total +134 across 2 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
Updated the PR body with a Real behavior proof section and copied after-fix source-runtime output for the patched gateway lifecycle path. /review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper automerge |
|
🦞✅ Source: What merged:
Automerge notes:
The automerge loop is complete. Automerge progress:
|
|
🦞✅ Source: Why human review is needed: What the maintainer can do as a next step: I added |
2e267c6 to
05974a8
Compare
|
Refreshed this PR onto latest main, replaced the PR body with the updated evidence, and verified the current check set is green. The earlier agents-core-tools camera failure now passes on the refreshed branch. @clawsweeper re-review |
|
🦞👀 Command router queued. I will update this comment with the next step. |
|
🦞✅ Source: Why human review is needed: What the maintainer can do as a next step: I added |
|
Addressed the latest ClawSweeper feedback by updating the PR body with a redacted real gateway runtime diagnostic-channel proof on current head 53b37e5. The run uses the actual patched createChannelManager path with real timers and a real locally registered channel plugin, reproduces the stale aborted task sequence after the 5000ms stop timeout, and shows the late connected/null-error patch no longer clears the timeout diagnostic. @clawsweeper re-review |
|
🦞👀 Command router queued. I will update this comment with the next step. |
) Summary: - This PR sanitizes status patches from aborted channel tasks in the gateway manager and adds regression tests for stale restart diagnostics. - PR surface: Source +56, Tests +78. Total +134 across 2 files. - Reproducibility: yes. Source inspection and the PR's before-fix regression show the sequence: non-manual sto ... while the stale task remains, then a late `connected=true` / `lastError=null` status patch on current main. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(gateway): preserve stale restart diagnostics - PR branch already contained follow-up commit before automerge: fix(gateway): preserve stale channel restart diagnostics Validation: - ClawSweeper review passed for head 53b37e5. - Required merge gates passed before the squash merge. Prepared head SHA: 53b37e5 Review: openclaw#90937 (comment) Co-authored-by: snowzlm <snowzlm@noreply.codeberg.org> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
Summary
Fixes #90901.
This keeps gateway-owned lifecycle state authoritative after a channel task has been aborted. A stale Telegram/polling-style task can still emit a late status patch after a recovery stop timeout. Before this change, that late patch could re-mark the account as
connected=trueand clearlastErrorwhile the manager still hadrunning=false,restartPending=true, andreconnectAttempts=0.The fix:
channel stop timed out after 5000msinstead of allowinglastError=null;Current branch state
The PR branch was refreshed on top of the latest
openclaw/openclaw@mainafter the earlier CI failure inchecks-node-agentic-agents-core-tools. ClawSweeper then ran its automerge repair lane and kept the fix on the contributor branch.b8adc11977ab9dc1eb558dc070bfe63df75911c5(feat(cron): support command jobs)53b37e50739078238b0b582c4db5da61ddd66cb8src/gateway/server-channels.tssrc/gateway/server-channels.test.tsThe earlier failing CI shard was reproduced against the refreshed branch and now passes:
That confirms the earlier camera inline-image failures were from the old PR branch being behind the current mainline test/runtime baseline, not from this gateway lifecycle change.
Review feedback addressed
ClawSweeper's latest review did not identify a code defect. The remaining P1 blocker was proof quality:
This update adds a redacted real gateway runtime diagnostic-channel run below. It uses the actual patched
createChannelManagerruntime path, real timers, a real channel plugin registered through the plugin registry, and no Vitest/fake-timer harness. It exercises the stale aborted task sequence after the 5000ms stop-timeout path.Reproduction / before evidence
Added regression coverage for the reported shape:
connected=true/lastError=nullstatus patch.Before the fix, the new regression fails because the stale task revives
connected=truein the stopped runtime:6739bbf06eda1740878cfcc2097f39609d18de69src/gateway/server-channels.test.ts > keeps recovery timeout diagnostics when a stale task reports connected after abortexpected true to be falseAfter evidence
With the fix, the same regression passes and the runtime keeps the actionable timeout diagnostic instead of returning to
connected=true/lastError=null.Focused verification on the refreshed latest-main branch:
Cloud proof after refreshing the branch:
53b37e50739078238b0b582c4db5da61ddd66cb8Cloud CI for the current head is green after ClawSweeper refreshed the contributor branch.
gh pr checks 90937 --repo openclaw/openclawexits 0, includingchecks-node-agentic-agents-core-tools, gateway shards, CodeQL security-high shards, OpenGrep, build artifacts, and Real behavior proof.Real behavior proof
53b37e50739078238b0b582c4db5da61ddd66cb8.createChannelManager, performs a non-manual stop that reaches the real 5000ms recovery timeout, requests restart while the stale task is still present, and then emits the late staleconnected=true/lastError=nulltask status patch.{ "proof": "pr-90937-real-gateway-runtime-diagnostic-channel", "head": "53b37e50739078238b0b582c4db5da61ddd66cb8", "durationMs": 5255, "startCount": 1, "capturedAbort": true, "afterStart": { "running": true, "connected": true, "lastError": null }, "beforeLateStatusPatch": { "running": false, "connected": false, "restartPending": true, "reconnectAttempts": 0, "lastError": "channel stop timed out after 5000ms" }, "afterLateStaleStatusPatch": { "running": false, "connected": false, "restartPending": true, "reconnectAttempts": 0, "lastError": "channel stop timed out after 5000ms" } }Runtime stderr also emitted the expected gateway timeout diagnostic:
Additional focused regression evidence on the same code path:
Additional cloud proof on the refreshed PR line:
connected=true,lastError=null) after the stale aborted task emits its late status patch; it remainsrunning=false,connected=false,restartPending=true,reconnectAttempts=0, withlastError="channel stop timed out after 5000ms".Risk / behavior notes
lastError=nullwith a stale connected marker.