Skip to content

Stale diagnostic tool_call activity can survive recovery/reset and re-block sessions as blocked_tool_call #87310

@shbernal

Description

@shbernal

Summary

OpenClaw diagnostics can retain stale native tool-call activity for a session after stuck-session recovery, session reset, or session replacement. Future work on the same sessionKey can then be classified as blocked_tool_call even when the original tool process is no longer running and the affected session transcript has already been reset/archived.

This is narrower than the existing stuck-session umbrella issues: the suspected defect is not just that a session stalls, but that the diagnostic activity tracker can keep an orphaned activeWorkKind=tool_call / activeTool=bash entry long enough to poison later recovery decisions for the same session key.

Environment

  • OpenClaw: 2026.5.20 (e510042)
  • OS: Linux 7.0.10-arch1-1 x86_64
  • Node: v24.11.1
  • Gateway: systemd user service, loopback gateway
  • Runtime observed: embedded Codex app-server / native tool execution

Observed behavior

A local gateway repeatedly emitted stalled-session diagnostics with this shape, redacted:

stalled session: sessionId=<redacted> sessionKey=<redacted>
  state=processing
  reason=blocked_tool_call
  classification=blocked_tool_call
  activeWorkKind=tool_call
  activeTool=bash
  activeToolAgeMs=<very large>
  lastProgress=<old tool start>
  recovery=checking

The active tool ages were on the order of many hours. The two stale tool ids found in local state resolved only to reset/trajectory archives, not live active sessions. One stale tool was a long foreground bash command that had started a dev server; another was an rg command. At investigation time, there was no matching live process for the dev-server case, and the current gateway stability view showed no active work after recovery.

Recovery later emitted an abort/drain outcome for an embedded run, for example:

stuck session recovery outcome: status=aborted action=abort_embedded_run
  activeWorkKind=embedded_run
  aborted=true drained=true forceCleared=false released=0

The concerning part is that stale native tool activity and embedded-run/reply-run recovery appear to be tracked by separate state paths. If the native tool never emits the matching completion event, or if recovery/reset replaces the session before the completion event is reconciled, the diagnostic tracker can continue to report tool_call as active and drive blocked_tool_call classification for future turns.

Expected behavior

Stuck-session recovery, session reset, and session replacement should clear or reconcile diagnostic active-work state for the affected sessionId/sessionKey, including native tool calls.

After a recovery aborts/drains an embedded run or a session is reset/replaced:

  1. stale activeTools entries for the old session/run should not continue to classify later turns as blocked_tool_call;
  2. blocked_tool_call should mean there is still an owned active native tool, not just an orphaned diagnostic record;
  3. if the original tool cannot be cancelled or observed, the diagnostic state should be explicitly evicted/quarantined with a structured recovery event.

Actual behavior

A stale activeTool=bash record can remain associated with a session key for many hours, repeatedly producing session.stalled with classification=blocked_tool_call. Recovery can abort/drain the embedded run but does not clearly guarantee that stale native tool activity for the session key is cleared.

Source-level suspect

Relevant current source paths:

  • src/logging/diagnostic-run-activity.ts
    • recordToolStarted adds native tools to activeTools.
    • recordToolEnded removes them.
    • recordRunCompleted clears active tools/model calls/embedded runs.
    • markDiagnosticEmbeddedRunEnded can clear run activity, but callers can opt out.
  • src/auto-reply/reply/reply-run-registry.ts
    • markReplyRunDiagnosticWorkEnded calls markDiagnosticEmbeddedRunEnded(..., clearRunActivity: false).
  • src/logging/diagnostic-session-attention.ts
    • stale tool_call activity is classified as blocked_tool_call.
  • src/logging/diagnostic.ts
    • isBlockedToolCallRecoveryEligible allows recovery once the blocked tool call crosses the abort threshold.
  • src/logging/diagnostic-stuck-session-recovery.runtime.ts
    • recovery can abort active embedded work, but the cleanup contract for orphaned native tool activity is not obvious from the observed outcome.

The missing primitive may be something like a targeted diagnostic cleanup/reconciliation path, for example clearDiagnosticSessionActivity({ sessionId, sessionKey, reason }), called when recovery aborts/drains a run, when a session is reset/replaced, and when a diagnostic tool-call owner is no longer present.

Suggested fix shape

  1. Add a diagnostic activity cleanup primitive that removes or quarantines all active tool/model/embedded-run state for a given sessionId and/or sessionKey.
  2. Call it from stuck-session recovery when reason=blocked_tool_call recovery aborts/drains a run or determines the owner is gone.
  3. Call it from session reset/replacement paths after the old session is archived or superseded.
  4. Add regression coverage for:
    • native tool_call starts and never emits completion;
    • session is reset/replaced or embedded recovery aborts/drains;
    • later work on the same sessionKey is not classified as blocked_tool_call from the old tool;
    • genuine active native tools still classify as blocked until completion/abort.
  5. Emit a structured diagnostic event when stale tool activity is evicted, so operators can distinguish a real running tool from recovered stale state.

Related issues

This is a narrow follow-up to prior stuck-session/recovery work rather than a duplicate:

Impact

A single orphaned native tool diagnostic record can keep a session lane looking blocked long after the original command is gone. The user-visible effect is delayed or lost replies, repeated stalled-session logs, and confusing recovery outcomes that make the gateway appear to still be blocked by bash when there is no corresponding live tool process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions