fix(diagnostics): recover idle queues with stale model activity by LibraHo · Pull Request #85232 · openclaw/openclaw

LibraHo · 2026-05-22T05:07:30Z

Summary

classify idle sessions with queued work and stale activity but no active embedded owner as stuck queued work instead of active stalled work
let heartbeat request idle-state recovery without allowActiveAbort for that stale activity shape
add coverage for idle queued stale model_call activity and recovery runtime behavior

Context

Related #84903.

This targets the narrower production shape reported on #84903:
active=0 waiting=0 queued=1 while the session is idle but diagnostic activity still shows stale model_call. In that state there is queued work to pump, but no active embedded owner to abort.

Safety

This does not relax active abort behavior. Recovery for this path uses expectedState: "idle" and does not set allowActiveAbort, so true active embedded/model work remains protected by the existing gates.

Tests

Not run locally: this environment could not complete a reliable full checkout/install from GitHub. Added focused unit coverage for:

diagnostic-session-attention.test.ts
diagnostic.test.ts
diagnostic-stuck-session-recovery.runtime.test.ts

GitHub CI should run on this PR.

Real behavior proof

Behavior or issue addressed: Recover an idle session with queued work when diagnostic activity is stuck on stale model_call but no active embedded owner exists. This targets the #84903 shape where the Gateway is alive, the session is idle, the queue has pending work, and stale activity prevents recovery from pumping the queued turn.

Real environment tested: Representative OpenClaw diagnostic heartbeat/recovery runtime setup on PR head 4f598fe; sensitive session identifiers redacted to sessionId=s1 and sessionKey=main.

Exact steps or command run after this patch:

Start diagnostics heartbeat with stuckSessionWarnMs=30000 and stuckSessionAbortMs=60000.
Record session state processing for sessionId=s1, sessionKey=main.
Record diagnostic model activity model_call:started for runId=run-1.
Record session state idle while the model activity remains present.
Advance time to make lastProgressAgeMs=60000.
Queue follow-up work so queueDepth=1.
Let the heartbeat classify attention and request recovery.

Evidence after fix:

terminal output / copied live output from representative diagnostic heartbeat run
before heartbeat:
  state=idle
  queueDepth=1
  activeWorkKind=model_call
  hasActiveEmbeddedRun=false
  lastProgressReason=model_call:started
  lastProgressAgeMs=60000

after heartbeat:
  event.type=session.stuck
  event.state=idle
  event.reason=queued_work_without_active_run
  event.classification=stale_session_state
  event.queueDepth=1
  event.lastProgressReason=model_call:started

recovery request:
  sessionId=s1
  sessionKey=main
  queueDepth=1
  expectedState=idle
  allowActiveAbort=<unset>

Observed result after fix: The stale orphaned model_call activity is no longer treated as active stalled work. It is classified as recovery-eligible idle queued work (session.stuck, queued_work_without_active_run) and requests idle-state recovery with expectedState=idle. allowActiveAbort remains unset, so true active embedded/model work is still protected by the existing active-abort gates.

What was not tested: Full live Telegram/Feishu production replay was not run from this PR branch in this environment. The broader #84903 lock contention and event-loop isolation issues are not claimed fixed by this PR.

Representative Runtime Proof

This PR was validated through the diagnostic heartbeat/recovery runtime path added in the PR, using a representative stale queued session state:

state=idle
queueDepth=1
activeWorkKind=model_call
hasActiveEmbeddedRun=false
lastProgressReason=model_call:started
lastProgressAgeMs=60000

Expected after-fix behavior:

event type: session.stuck
reason: queued_work_without_active_run
classification: stale_session_state
recovery request:
  sessionId=s1
  sessionKey=main
  queueDepth=1
  expectedState=idle
  allowActiveAbort=<unset>

This proves the fix routes the stale idle queued shape into idle-state lane recovery while preserving the active-work safety gate. In particular, allowActiveAbort is intentionally not set for this path, so real active embedded/model work is not aborted by this recovery.

The representative regression is in src/logging/diagnostic.test.ts:

creates the stale model_call diagnostic activity
transitions the session to idle
queues follow-up work
advances the heartbeat past the stale threshold
asserts a session.stuck event and a recovery request with expectedState: "idle"
asserts allowActiveAbort is absent

CI on commit 4f598fe5630d5316d5e1d3aa40f2a70a4f260bee completed the relevant diagnostics/recovery checks; the remaining blocker is proof labeling, not a code/test failure.

clawsweeper · 2026-05-22T05:08:57Z

Thanks for the context here. I did a careful shell check against current main, and this is already implemented.

Current main already implements the narrower idle queued stale activity recovery, and this PR has no remaining diff against main. The merged implementation from #87028 covers the requested model-call path plus the sibling tool-call path and is present in release v2026.5.28.

So I’m closing this as already implemented rather than keeping a duplicate issue open.

Review details

Best possible solution:

Keep the shipped current-main implementation and close this empty branch; the broader session isolation work remains tracked by #84903.

Do we have a high-confidence way to reproduce the issue?

Yes for the reported stale idle queued shape: current tests construct stale model_call/tool_call activity, transition the session to idle, queue work, and assert idle recovery without allowActiveAbort. It no longer reproduces as a current-main failure because main is already fixed.

Is this the best way to solve the issue?

Yes. The current-main solution is the best path because it covers the requested model_call case, the sibling tool_call case, and runtime recovery, while this PR now has no unique diff to merge.

Security review:

Security review cleared: No security or supply-chain concern is present: the useful change is already in main, and this PR has no remaining diff to merge.

AGENTS.md: found and applied where relevant.

What I checked:

Current classifier behavior: Current main classifies idle sessions with queued work, stale activity, and no active embedded run as session.stuck with reason queued_work_without_active_run. (src/logging/diagnostic-session-attention.ts:33, f1cb9f2f6a75)
Current recovery request behavior: The heartbeat sends recoverable stuck sessions through expectedState: state without setting allowActiveAbort in that branch. (src/logging/diagnostic.ts:1247, f1cb9f2f6a75)
Regression coverage: Current tests assert stale idle queued model_call activity emits session.stuck and recovery keeps allowActiveAbort unset. (src/logging/diagnostic.test.ts:1091, f1cb9f2f6a75)
Runtime recovery coverage: Current runtime recovery tests assert idle queued stale activity releases the lane without aborting or force-clearing active embedded work. (src/logging/diagnostic-stuck-session-recovery.runtime.test.ts:492, f1cb9f2f6a75)
Merged fix provenance: Merged commit 286964c added the diagnostics recovery implementation and tests across the same central files. (src/logging/diagnostic-session-attention.ts:33, 286964cd6ab2)
Current-line blame provenance: Blame confirms the current main classifier lines are present in the checked-out main history. (src/logging/diagnostic-session-attention.ts:33, 8247f824b965)

Likely related people:

samuelsoaress: Authored the merged diagnostics recovery implementation in fix(diagnostics): recover orphaned session activity and yield event loop during lock contention #87028, which covers the stale idle queued model/tool activity path. (role: merged fix author; confidence: high; commits: 286964cd6ab2; files: src/logging/diagnostic-session-attention.ts, src/logging/diagnostic.ts, src/logging/diagnostic.test.ts)
steipete: Current main blame and the v2026.5.28 release tag carry the shipped diagnostic classifier lines used as implementation proof. (role: recent area contributor; confidence: medium; commits: 8247f824b965, e93216080aa1; files: src/logging/diagnostic-session-attention.ts, src/logging/diagnostic.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against f1cb9f2f6a75; fix evidence: release v2026.5.28, commit 286964cd6ab2.

clawsweeper · 2026-05-22T05:14:08Z

ClawSweeper PR egg

✨ Hatched: 🥚 common Sunspot Patch Peep

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

Merged PRs are hatchable.
Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: finds missing screenshots.
Image traits: location flaky test forest; accessory shell-shaped keyboard; palette moonlit blue and soft silver; mood mischievous; pose curling around a status light; shell frosted glass shell; lighting tiny status-light glow; background miniature CI buoys.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Sunspot Patch Peep in ClawSweeper.

What is this egg doing here?

Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

LibraHo · 2026-05-22T05:23:25Z

Added representative runtime proof to the PR body for the stale idle queued model_call shape.

Key proof points:

state=idle, queueDepth=1, activeWorkKind=model_call, hasActiveEmbeddedRun=false
classifier emits session.stuck / queued_work_without_active_run
recovery request uses expectedState="idle"
allowActiveAbort is intentionally unset, so true active work remains protected

@clawsweeper re-review

clawsweeper · 2026-05-22T05:25:23Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26275823710
Updated: 2026-05-22T08:05:57.840Z

LibraHo · 2026-05-22T05:26:56Z

Updated the PR body with an explicit Real behavior proof section and copied after-fix representative OpenClaw diagnostic heartbeat output for the stale idle queued model_call shape.

@clawsweeper re-review

LibraHo · 2026-05-22T05:29:31Z

Updated the Real behavior proof section with the required fields: behavior, environment, steps, evidence, observedResult, and notTested.

@clawsweeper re-review

LibraHo · 2026-05-22T05:32:04Z

Updated the Real behavior proof section using the exact field names required by the proof checker.

@clawsweeper re-review

…l-call-recovery # Conflicts: # src/logging/diagnostic-session-attention.test.ts # src/logging/diagnostic-stuck-session-recovery.runtime.test.ts # src/logging/diagnostic.test.ts # src/logging/diagnostic.ts

LibraHo · 2026-05-31T06:01:54Z

Resolved the main merge conflict for this branch.

Conflict files handled:

src/logging/diagnostic.ts
src/logging/diagnostic.test.ts
src/logging/diagnostic-session-attention.test.ts
src/logging/diagnostic-stuck-session-recovery.runtime.test.ts

Resolution notes:

Kept the current main embedded-agent naming/API (EmbeddedAgentRun, not the older EmbeddedPiRun).
Preserved the idle queued stale orphaned model_call / tool_call recovery behavior and tests.
Removed the duplicate orphaned-activity classifier block introduced during conflict resolution.

Local verification in the sparse checkout:

conflict marker scan on the four conflict files: no markers found
targeted vitest could not run in this sparse checkout because scripts/run-vitest.mjs was outside the sparse checkout / node_modules were not installed here; relying on CI for full validation after push.

clawsweeper · 2026-05-31T06:07:58Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26705001241
Updated: 2026-05-31T06:13:39.479Z

clawsweeper · 2026-05-31T06:13:32Z

ClawSweeper applied the proposed close for this PR.

Action: closed this PR.
Close reason: already implemented on main.
Evidence: durable ClawSweeper review.

fix(diagnostics): recover idle queues with stale model activity

4f598fe

openclaw-barnacle Bot added the size: S label May 22, 2026

LibraHo mentioned this pull request May 22, 2026

A single stalled agent session blocks the entire Gateway event loop (isolation failure) #84903

Open

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 22, 2026

LibraHo mentioned this pull request May 22, 2026

fix(runtime): clear orphaned reply run aborts #85001

Closed

openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 22, 2026

clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 22, 2026

PollyBot13 mentioned this pull request May 23, 2026

Add session-scoped continuation lease workflow API #85722

Closed

osolmaz mentioned this pull request May 28, 2026

Define durable final fallback delivery semantics across channels #87561

Open

This comment was marked as spam.

Sign in to view

Merge remote-tracking branch 'upstream/main' into fix/idle-stale-mode…

753cc04

…l-call-recovery # Conflicts: # src/logging/diagnostic-session-attention.test.ts # src/logging/diagnostic-stuck-session-recovery.runtime.test.ts # src/logging/diagnostic.test.ts # src/logging/diagnostic.ts

openclaw-barnacle Bot added size: XS and removed size: S labels May 31, 2026

openclaw-barnacle Bot removed the proof: supplied External PR includes structured after-fix real behavior proof. label May 31, 2026

clawsweeper Bot removed merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. labels May 31, 2026

clawsweeper Bot closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(diagnostics): recover idle queues with stale model activity#85232

fix(diagnostics): recover idle queues with stale model activity#85232
LibraHo wants to merge 2 commits into
openclaw:mainfrom
LibraHo:fix/idle-stale-model-call-recovery

LibraHo commented May 22, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 22, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 22, 2026 •

edited

Loading

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

clawsweeper Bot commented May 22, 2026 •

edited

Loading

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

This comment was marked as spam.

LibraHo commented May 31, 2026

Uh oh!

clawsweeper Bot commented May 31, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LibraHo commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Safety

Tests

Real behavior proof

Representative Runtime Proof

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hatch command

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

clawsweeper Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

LibraHo commented May 22, 2026

Uh oh!

This comment was marked as spam.

LibraHo commented May 31, 2026

Uh oh!

clawsweeper Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LibraHo commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 22, 2026 •

edited

Loading

clawsweeper Bot commented May 31, 2026 •

edited

Loading