fix(diagnostics): recover idle queues with stale model activity#85232
fix(diagnostics): recover idle queues with stale model activity#85232LibraHo wants to merge 2 commits into
Conversation
|
Thanks for the context here. I did a careful shell check against current Current main already implements the narrower idle queued stale activity recovery, and this PR has no remaining diff against main. The merged implementation from #87028 covers the requested model-call path plus the sibling tool-call path and is present in release v2026.5.28. So I’m closing this as already implemented rather than keeping a duplicate issue open. Review detailsBest possible solution: Keep the shipped current-main implementation and close this empty branch; the broader session isolation work remains tracked by #84903. Do we have a high-confidence way to reproduce the issue? Yes for the reported stale idle queued shape: current tests construct stale model_call/tool_call activity, transition the session to idle, queue work, and assert idle recovery without allowActiveAbort. It no longer reproduces as a current-main failure because main is already fixed. Is this the best way to solve the issue? Yes. The current-main solution is the best path because it covers the requested model_call case, the sibling tool_call case, and runtime recovery, while this PR now has no unique diff to merge. Security review: Security review cleared: No security or supply-chain concern is present: the useful change is already in main, and this PR has no remaining diff to merge. AGENTS.md: found and applied where relevant. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against f1cb9f2f6a75; fix evidence: release v2026.5.28, commit 286964cd6ab2. |
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Sunspot Patch Peep Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
Added representative runtime proof to the PR body for the stale idle queued model_call shape. Key proof points:
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Updated the PR body with an explicit Real behavior proof section and copied after-fix representative OpenClaw diagnostic heartbeat output for the stale idle queued model_call shape. @clawsweeper re-review |
|
Updated the Real behavior proof section with the required fields: behavior, environment, steps, evidence, observedResult, and notTested. @clawsweeper re-review |
|
Updated the Real behavior proof section using the exact field names required by the proof checker. @clawsweeper re-review |
This comment was marked as spam.
This comment was marked as spam.
…l-call-recovery # Conflicts: # src/logging/diagnostic-session-attention.test.ts # src/logging/diagnostic-stuck-session-recovery.runtime.test.ts # src/logging/diagnostic.test.ts # src/logging/diagnostic.ts
|
Resolved the Conflict files handled:
Resolution notes:
Local verification in the sparse checkout:
|
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
ClawSweeper applied the proposed close for this PR.
|
Summary
allowActiveAbortfor that stale activity shapemodel_callactivity and recovery runtime behaviorContext
Related #84903.
This targets the narrower production shape reported on #84903:
active=0 waiting=0 queued=1while the session is idle but diagnostic activity still shows stalemodel_call. In that state there is queued work to pump, but no active embedded owner to abort.Safety
This does not relax active abort behavior. Recovery for this path uses
expectedState: "idle"and does not setallowActiveAbort, so true active embedded/model work remains protected by the existing gates.Tests
Not run locally: this environment could not complete a reliable full checkout/install from GitHub. Added focused unit coverage for:
diagnostic-session-attention.test.tsdiagnostic.test.tsdiagnostic-stuck-session-recovery.runtime.test.tsGitHub CI should run on this PR.
Real behavior proof
Behavior or issue addressed: Recover an idle session with queued work when diagnostic activity is stuck on stale model_call but no active embedded owner exists. This targets the #84903 shape where the Gateway is alive, the session is idle, the queue has pending work, and stale activity prevents recovery from pumping the queued turn.
Real environment tested: Representative OpenClaw diagnostic heartbeat/recovery runtime setup on PR head 4f598fe; sensitive session identifiers redacted to sessionId=s1 and sessionKey=main.
Exact steps or command run after this patch:
Evidence after fix:
Observed result after fix: The stale orphaned model_call activity is no longer treated as active stalled work. It is classified as recovery-eligible idle queued work (session.stuck, queued_work_without_active_run) and requests idle-state recovery with expectedState=idle. allowActiveAbort remains unset, so true active embedded/model work is still protected by the existing active-abort gates.
What was not tested: Full live Telegram/Feishu production replay was not run from this PR branch in this environment. The broader #84903 lock contention and event-loop isolation issues are not claimed fixed by this PR.
Representative Runtime Proof
This PR was validated through the diagnostic heartbeat/recovery runtime path added in the PR, using a representative stale queued session state:
Expected after-fix behavior:
This proves the fix routes the stale idle queued shape into idle-state lane recovery while preserving the active-work safety gate. In particular,
allowActiveAbortis intentionally not set for this path, so real active embedded/model work is not aborted by this recovery.The representative regression is in
src/logging/diagnostic.test.ts:model_calldiagnostic activityidlesession.stuckevent and a recovery request withexpectedState: "idle"allowActiveAbortis absentCI on commit
4f598fe5630d5316d5e1d3aa40f2a70a4f260beecompleted the relevant diagnostics/recovery checks; the remaining blocker is proof labeling, not a code/test failure.