Skip to content

fix(diagnostics): evict orphaned tool/model activity on owner-less run end#90750

Open
849261680 wants to merge 4 commits into
openclaw:mainfrom
849261680:fix/87310-diagnostic-stale-tool-activity
Open

fix(diagnostics): evict orphaned tool/model activity on owner-less run end#90750
849261680 wants to merge 4 commits into
openclaw:mainfrom
849261680:fix/87310-diagnostic-stale-tool-activity

Conversation

@849261680

@849261680 849261680 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Stale native tool_call diagnostic activity could survive a clean run teardown and re-block later turns on the same sessionKey as blocked_tool_call (#87310).

markDiagnosticEmbeddedRunEnded({ clearRunActivity: false }) — used by the reply-run wrapper so it does not clobber a still-draining inner embedded run — removed only the embedded-run marker and left tool/model markers untouched. When a native tool never emitted its matching tool.execution.completed event (e.g. the embedded Codex app-server tool whose owner is torn down first), that marker leaked into the per-session activity (keyed by sessionKey). A later turn then saw activeWorkKind=tool_call with a very large activeToolAgeMs, was classified blocked_tool_call, and produced repeated session.stalled diagnostics — and recovery could even abort the fresh, healthy run.

Fix

  • src/logging/diagnostic-run-activity.ts: when an embedded/reply run ends and no embedded-run owner remains for the session, leftover tool/model markers are now evicted (they can never report completion). A still-active inner run keeps its markers, preserving the original reason clearRunActivity: false existed.
  • New structured session.activity.evicted event (reason=orphaned_no_owner, evictedTools/evictedModelCalls) so operators can distinguish recovered stale state from a real active tool (issue items 3 & 5). Recorded in the stability ring buffer and exported as the openclaw.session.activity.evicted (OTel) / openclaw_session_activity_evicted_total (Prometheus) counter.
  • Docs updated in docs/gateway/opentelemetry.md and docs/gateway/prometheus.md.

Recovery-driven cleanup (clearDiagnosticEmbeddedRunActivityForSession) is unchanged and remains observable via the existing session.recovery.completed event; this change targets the non-recovery clean-teardown path that previously evicted nothing and leaked silently.

Real behavior proof (runtime diagnostics)

  • Behavior addressed: Orphaned native tool_call (and model_call) diagnostic activity surviving a clean reply-run completion and re-blocking the next turn on the same sessionKey as blocked_tool_call.
  • Real environment tested: An in-process runtime harness exercising the real diagnostic runtime — startDiagnosticHeartbeat (30s warn / 60s abort) wired to the real recoverStuckDiagnosticSession from diagnostic-stuck-session-recovery.runtime.ts and the real reply-run-registry — on macOS, Node 22.22.2. The diagnosticLogger.warn output below is the actual emitted runtime log, not a mocked assertion.
  • Exact steps or command run after this patch: Drove the real heartbeat: turn 1 opened a reply operation on agent:main:slack:channel:REDACTED, started a native bash tool that never emits tool.execution.completed, then completed the reply run and marked the session idle; turn 2 queued a new message on the same session key and went processing; advanced 8 heartbeat cycles past the 60s abort threshold and captured diagnosticLogger.warn. The same harness was run once with the patch reverted to capture the "before" lines.
  • Observed result after fix: The reason=blocked_tool_call classification=blocked_tool_call line no longer appears for the follow-up turn; the orphaned bash marker is evicted at reply completion and the follow-up turn is handled as a recoverable stale lane (release_lane), so the session converges instead of repeatedly re-blocking.
  • Evidence after fix: Redacted runtime diagnostic logger output captured from the real diagnostic heartbeat + stuck-session recovery runtime (synthetic session ids):
BEFORE (patch reverted) — turn 2 is misclassified off the orphaned tool from turn 1:
stalled session: sessionId=SESSION-2 sessionKey=agent:main:slack:channel:REDACTED state=processing age=60s queueDepth=1 reason=blocked_tool_call classification=blocked_tool_call activeWorkKind=tool_call lastProgress=embedded_run:ended lastProgressAge=60s activeTool=bash activeToolCallId=T1 activeToolAge=60s terminalProgressStale=true recovery=checking

AFTER (this patch) — the orphaned bash marker is evicted on reply completion; turn 2 is a recoverable stale lane, never blocked_tool_call:
stuck session: sessionId=SESSION-2 sessionKey=agent:main:slack:channel:REDACTED state=processing age=60s queueDepth=1 reason=queued_work_without_active_run classification=stale_session_state lastProgress=embedded_run:ended lastProgressAge=60s terminalProgressStale=true recovery=checking
stuck session recovery: sessionId=SESSION-2 sessionKey=agent:main:slack:channel:REDACTED age=60s action=release_lane aborted=false drained=true released=0
stuck session recovery outcome: status=released action=release_lane sessionId=SESSION-2 sessionKey=agent:main:slack:channel:REDACTED lane=session:agent:main:slack:channel:REDACTED released=0
  • What was not tested: A live external systemd gateway + Slack socket-mode deployment was not stood up; the timing-sensitive hung-native-tool condition is non-deterministic in production and was reproduced deterministically through the real in-process runtime path. OTel/Prometheus exporter wiring is covered by exporter tests rather than a scraped live collector.

Verification

Suites run after the patch: diagnostic, reply-run-registry, diagnostic-stability, diagnostic-stuck-session-recovery.runtime/.integration, diagnostic-session-attention, diagnostics-otel, diagnostics-prometheus — all green. tsgo core + core-test + extensions + extensions-test clean; oxlint/oxfmt clean. New regression tests:

  • reply-run-registry: evicts an orphaned native tool when a reply run completes without its completion.
  • diagnostic: evicts orphaned tool/model markers when the last owner ends without clearing run activity; keeps tool markers when an inner embedded run is still active.
  • diagnostics-otel / diagnostics-prometheus: export the session.activity.evicted counter by total markers without leaking raw ids.

Closes #87310

@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. docs Improvements or additions to documentation gateway Gateway runtime size: M labels Jun 5, 2026
@clawsweeper

clawsweeper Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 6, 2026, 2:05 AM ET / 06:05 UTC.

Summary
The PR evicts owner-less diagnostic tool/model activity at reply-run teardown, emits and exports a session.activity.evicted telemetry event, and updates focused regression/exporter docs and tests.

PR surface: Source +135, Tests +224, Docs +11. Total +370 across 11 files.

Reproducibility: yes. Current main leaves tool/model markers untouched on the clearRunActivity: false reply-run path, and the PR body supplies before/after runtime logs showing the stale blocked_tool_call classification disappearing after eviction.

Review metrics: 1 noteworthy metric.

  • Diagnostic telemetry surfaces: 1 event added, 2 exporter counters added, 2 docs updated. The PR adds an operator-visible diagnostics contract, so maintainers should review the event and metric names before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] The patch changes active-work cleanup at reply-run teardown; a wrong owner-boundary decision could either drop genuine active tool/model markers or leave stale markers that continue to affect stuck-session recovery and queued replies.
  • [P1] The supplied proof is strong in-process runtime evidence, but it is not a live external Slack/systemd gateway reproduction of the timing-sensitive production condition.

Maintainer options:

  1. Land with focused diagnostics proof (recommended)
    Accept the targeted session-state risk once the focused diagnostics, reply-run registry, exporter, type, lint, and format checks are confirmed for head 83c3212755ca6fe6e6c93464ad56dab6dd0283fb.
  2. Ask for live gateway proof
    Pause merge and request a live gateway or transport-level artifact if maintainers want production-shaped proof beyond the deterministic in-process runtime harness.
  3. Close in favor of a broader recovery redesign
    Close or defer this PR only if maintainers decide owner-less diagnostic cleanup should be redesigned together with the wider stuck-session recovery surface.

Next step before merge

  • [P2] No narrow automated repair remains; maintainers should review/land the open fix PR or request stronger live gateway proof before merge.

Security
Cleared: No concrete security or supply-chain regression was found; the diff changes diagnostics state, telemetry exporters, docs, and tests without dependency, workflow, secret, or install-script changes.

Review details

Best possible solution:

Land the centralized diagnostic activity cleanup after maintainer acceptance of focused CI/proof, keeping the behavior in diagnostic-run-activity instead of adding channel-specific recovery logic.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main leaves tool/model markers untouched on the clearRunActivity: false reply-run path, and the PR body supplies before/after runtime logs showing the stale blocked_tool_call classification disappearing after eviction.

Is this the best way to solve the issue?

Yes, this appears to be the best focused fix. The cleanup is placed in the central diagnostic activity owner, preserves active inner embedded runs, and fences late async tool/model starts; a recovery-only fix would miss the clean reply-run teardown path.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 743051d400ff.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted before/after runtime diagnostic logs from the real heartbeat and stuck-session recovery path, which is sufficient proof for this non-visual diagnostics fix.

Label justifications:

  • P1: The linked behavior can leave later turns on the same session key misclassified as blocked, affecting real agent/channel replies.
  • merge-risk: 🚨 session-state: The diff changes how per-session diagnostic active-work state is cleared and fenced across reply-run teardown.
  • merge-risk: 🚨 message-delivery: If the owner boundary is wrong, queued follow-up replies could still be blocked or fresh work could be recovered incorrectly.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (logs): The PR body includes redacted before/after runtime diagnostic logs from the real heartbeat and stuck-session recovery path, which is sufficient proof for this non-visual diagnostics fix.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted before/after runtime diagnostic logs from the real heartbeat and stuck-session recovery path, which is sufficient proof for this non-visual diagnostics fix.
Evidence reviewed

PR surface:

Source +135, Tests +224, Docs +11. Total +370 across 11 files.

View PR surface stats
Area Files Added Removed Net
Source 5 138 3 +135
Tests 4 225 1 +224
Docs 2 11 0 +11
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 11 374 4 +370

What I checked:

  • Current-main leak path: On current main, markDiagnosticEmbeddedRunEnded removes the embedded-run marker but leaves tool/model markers intact whenever clearRunActivity is false, which is the central stale-activity path this PR targets. (src/logging/diagnostic-run-activity.ts:324, 743051d400ff)
  • Reply-run owner boundary: The reply-run registry intentionally ends its diagnostic wrapper with clearRunActivity: false, so the owner-less cleanup belongs in the shared diagnostic activity tracker rather than a channel-specific path. (src/auto-reply/reply/reply-run-registry.ts:221, 83c3212755ca)
  • Patch cleanup behavior: The PR keeps the existing default full clear, but when a clearRunActivity: false wrapper ends and no embedded owner remains, it fences late start events at the current diagnostic sequence and clears orphaned tool/model markers with an eviction event. (src/logging/diagnostic-run-activity.ts:327, 83c3212755ca)
  • Regression tests: The PR adds reply-run and diagnostic tests for orphan eviction, preserving markers while an inner embedded run is still active, and ignoring a queued tool start that drains after owner-less reply completion. (src/auto-reply/reply/reply-run-registry.test.ts:101, 83c3212755ca)
  • Exporter/docs coverage: The new event is wired into OTel and Prometheus counters without exporting raw session ids/keys, and the public gateway telemetry docs list the new metric. (extensions/diagnostics-otel/src/service.ts:1428, 83c3212755ca)
  • Codex/OpenClaw tool diagnostics contract: OpenClaw's Codex app-server projector emits trusted native tool start and terminal events with run/session identity, and sibling Codex source shows command completion events can arrive later by turn id, matching the PR's late-drain concern. (extensions/codex/src/app-server/event-projector.ts:1182, 83c3212755ca)

Likely related people:

  • shakkernerd: Git blame and git log -S clearRunActivity point the current diagnostic activity teardown and reply-run wrapper behavior to Shakker's commit a0840cad8fb3cae023fd4652b58c1a83d8e67815. (role: introduced behavior and recent area contributor; confidence: high; commits: a0840cad8fb3; files: src/logging/diagnostic-run-activity.ts, src/auto-reply/reply/reply-run-registry.ts, extensions/diagnostics-otel/src/service.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. labels Jun 5, 2026
@openclaw-barnacle openclaw-barnacle Bot added the extensions: diagnostics-otel Extension: diagnostics-otel label Jun 5, 2026
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 5, 2026
@openclaw-barnacle openclaw-barnacle Bot added extensions: diagnostics-prometheus proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 5, 2026
@849261680

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

The previous review evaluated an earlier head (7 files) and flagged a "no-dedicated-metric OTel design". That has since been addressed at the current head a02a88e8172b:

  • session.activity.evicted is now exported as a dedicated counter in both exporters — openclaw.session.activity.evicted (diagnostics-otel) and openclaw_session_activity_evicted_total (diagnostics-prometheus), incremented by total evicted markers with a reason attribute and no raw session ids — with exporter tests and metric docs on both pages.
  • The PR body now includes redacted before/after runtime diagnostic-logger output from the real diagnostic heartbeat + recoverStuckDiagnosticSession runtime: the blocked_tool_call line is present before the patch and gone after (the orphaned bash marker is evicted at reply completion, follow-up turn is a recoverable release_lane). The Real behavior proof CI gate is green (proof: supplied).

A fully live external Slack/systemd gateway repro of the timing-sensitive hung-native-tool condition is noted under "What was not tested"; the runtime path is exercised deterministically in-process.

@clawsweeper

clawsweeper Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 5, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 6, 2026
849261680 added 4 commits June 6, 2026 13:47
…n end

Stale native tool_call activity could survive a clean run teardown and
re-block later turns on the same sessionKey as blocked_tool_call.

markDiagnosticEmbeddedRunEnded({ clearRunActivity: false }) (used by the
reply-run wrapper so it does not clobber a still-draining inner run) removed
only the embedded-run marker and left tool/model markers untouched. When a
native tool never emitted its completion event, that marker leaked: a future
turn on the same session key saw activeWorkKind=tool_call with a very large
age and was misclassified as blocked_tool_call, repeatedly triggering stalled
diagnostics and even aborting fresh work.

Now, once no embedded-run owner remains for the session, leftover tool/model
markers are evicted (they can never report completion). A still-active inner
run keeps its markers. Eviction emits a structured session.activity.evicted
event (reason=orphaned_no_owner, evictedTools/evictedModelCalls) so operators
can distinguish recovered stale state from a real active tool.

Recovery-driven cleanup is unchanged and remains observable via
session.recovery.completed.

Closes openclaw#87310
The OTel exporter event switch is exhaustiveness-checked (oxlint). Add an
explicit no-op case for the new session.activity.evicted event, matching the
session.long_running / session.stalled precedent (eviction is captured by
stability records and recovery counters; no dedicated OTel metric).
…unter

Address the docs/exporter mismatch: the OpenTelemetry page advertised the new
session.activity.evicted event, but the exporters did not emit it. Wire both
exporters to emit a counter incremented by the total evicted markers
(openclaw.session.activity.evicted / openclaw_session_activity_evicted_total,
attr reason), document the metrics on the OTel and Prometheus pages, and add
exporter tests. No raw session ids/keys are exported.
…down

Address review: tool/model start events are async-queued, so a start emitted
before an owner-less reply-run teardown could drain after the eviction and
re-arm an owner-less marker, restoring the blocked_tool_call leak. The
owner-less eviction now records a start-event sequence cutoff for the session
owner refs at the current event sequence (the same mechanism stuck-session
recovery uses via recoveredOwnerStartEventCutoffs), so a late-draining start is
ignored instead of recreating the marker. Adds an async-drain regression test.
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 6, 2026
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 6, 2026
@849261680 849261680 force-pushed the fix/87310-diagnostic-stale-tool-activity branch from 3563d76 to 83c3212 Compare June 6, 2026 05:56
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 6, 2026
@849261680

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Addressed both findings from the last review and rebased onto current main (head 83c3212755, now mergeable):

  • [P1] late queued starts: owner-less reply-run teardown now records a start-event sequence cutoff for the session owner refs at the current event sequence (the same recoveredOwnerStartEventCutoffs mechanism stuck-session recovery uses), so a tool.execution.started / model.call.started emitted before teardown but drained afterward is ignored instead of re-arming an owner-less marker. Added an async-drain regression test (waitForDiagnosticEventsDrained) that fails without the cutoff.
  • [P2] dirty branch: rebased onto current upstream main; PR is mergeable again.

All affected suites + tsgo core/extensions/test configs + oxlint/oxfmt are green on the rebased base.

@clawsweeper

clawsweeper Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation extensions: diagnostics-otel Extension: diagnostics-otel extensions: diagnostics-prometheus gateway Gateway runtime merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stale diagnostic tool_call activity can survive recovery/reset and re-block sessions as blocked_tool_call

1 participant