fix(diagnostics): evict orphaned tool/model activity on owner-less run end#90750
fix(diagnostics): evict orphaned tool/model activity on owner-less run end#90750849261680 wants to merge 4 commits into
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 6, 2026, 2:05 AM ET / 06:05 UTC. Summary PR surface: Source +135, Tests +224, Docs +11. Total +370 across 11 files. Reproducibility: yes. Current main leaves tool/model markers untouched on the Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the centralized diagnostic activity cleanup after maintainer acceptance of focused CI/proof, keeping the behavior in Do we have a high-confidence way to reproduce the issue? Yes. Current main leaves tool/model markers untouched on the Is this the best way to solve the issue? Yes, this appears to be the best focused fix. The cleanup is placed in the central diagnostic activity owner, preserves active inner embedded runs, and fences late async tool/model starts; a recovery-only fix would miss the clean reply-run teardown path. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 743051d400ff. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +135, Tests +224, Docs +11. Total +370 across 11 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
@clawsweeper re-review The previous review evaluated an earlier head (7 files) and flagged a "no-dedicated-metric OTel design". That has since been addressed at the current head
A fully live external Slack/systemd gateway repro of the timing-sensitive hung-native-tool condition is noted under "What was not tested"; the runtime path is exercised deterministically in-process. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…n end
Stale native tool_call activity could survive a clean run teardown and
re-block later turns on the same sessionKey as blocked_tool_call.
markDiagnosticEmbeddedRunEnded({ clearRunActivity: false }) (used by the
reply-run wrapper so it does not clobber a still-draining inner run) removed
only the embedded-run marker and left tool/model markers untouched. When a
native tool never emitted its completion event, that marker leaked: a future
turn on the same session key saw activeWorkKind=tool_call with a very large
age and was misclassified as blocked_tool_call, repeatedly triggering stalled
diagnostics and even aborting fresh work.
Now, once no embedded-run owner remains for the session, leftover tool/model
markers are evicted (they can never report completion). A still-active inner
run keeps its markers. Eviction emits a structured session.activity.evicted
event (reason=orphaned_no_owner, evictedTools/evictedModelCalls) so operators
can distinguish recovered stale state from a real active tool.
Recovery-driven cleanup is unchanged and remains observable via
session.recovery.completed.
Closes openclaw#87310
The OTel exporter event switch is exhaustiveness-checked (oxlint). Add an explicit no-op case for the new session.activity.evicted event, matching the session.long_running / session.stalled precedent (eviction is captured by stability records and recovery counters; no dedicated OTel metric).
…unter Address the docs/exporter mismatch: the OpenTelemetry page advertised the new session.activity.evicted event, but the exporters did not emit it. Wire both exporters to emit a counter incremented by the total evicted markers (openclaw.session.activity.evicted / openclaw_session_activity_evicted_total, attr reason), document the metrics on the OTel and Prometheus pages, and add exporter tests. No raw session ids/keys are exported.
…down Address review: tool/model start events are async-queued, so a start emitted before an owner-less reply-run teardown could drain after the eviction and re-arm an owner-less marker, restoring the blocked_tool_call leak. The owner-less eviction now records a start-event sequence cutoff for the session owner refs at the current event sequence (the same mechanism stuck-session recovery uses via recoveredOwnerStartEventCutoffs), so a late-draining start is ignored instead of recreating the marker. Adds an async-drain regression test.
3563d76 to
83c3212
Compare
|
@clawsweeper re-review Addressed both findings from the last review and rebased onto current
All affected suites + tsgo core/extensions/test configs + oxlint/oxfmt are green on the rebased base. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
Summary
Stale native
tool_calldiagnostic activity could survive a clean run teardown and re-block later turns on the samesessionKeyasblocked_tool_call(#87310).markDiagnosticEmbeddedRunEnded({ clearRunActivity: false })— used by the reply-run wrapper so it does not clobber a still-draining inner embedded run — removed only the embedded-run marker and left tool/model markers untouched. When a native tool never emitted its matchingtool.execution.completedevent (e.g. the embedded Codex app-server tool whose owner is torn down first), that marker leaked into the per-session activity (keyed bysessionKey). A later turn then sawactiveWorkKind=tool_callwith a very largeactiveToolAgeMs, was classifiedblocked_tool_call, and produced repeatedsession.stalleddiagnostics — and recovery could even abort the fresh, healthy run.Fix
src/logging/diagnostic-run-activity.ts: when an embedded/reply run ends and no embedded-run owner remains for the session, leftover tool/model markers are now evicted (they can never report completion). A still-active inner run keeps its markers, preserving the original reasonclearRunActivity: falseexisted.session.activity.evictedevent (reason=orphaned_no_owner,evictedTools/evictedModelCalls) so operators can distinguish recovered stale state from a real active tool (issue items 3 & 5). Recorded in the stability ring buffer and exported as theopenclaw.session.activity.evicted(OTel) /openclaw_session_activity_evicted_total(Prometheus) counter.docs/gateway/opentelemetry.mdanddocs/gateway/prometheus.md.Recovery-driven cleanup (
clearDiagnosticEmbeddedRunActivityForSession) is unchanged and remains observable via the existingsession.recovery.completedevent; this change targets the non-recovery clean-teardown path that previously evicted nothing and leaked silently.Real behavior proof (runtime diagnostics)
tool_call(andmodel_call) diagnostic activity surviving a clean reply-run completion and re-blocking the next turn on the samesessionKeyasblocked_tool_call.startDiagnosticHeartbeat(30s warn / 60s abort) wired to the realrecoverStuckDiagnosticSessionfromdiagnostic-stuck-session-recovery.runtime.tsand the realreply-run-registry— on macOS, Node 22.22.2. ThediagnosticLogger.warnoutput below is the actual emitted runtime log, not a mocked assertion.agent:main:slack:channel:REDACTED, started a nativebashtool that never emitstool.execution.completed, then completed the reply run and marked the session idle; turn 2 queued a new message on the same session key and wentprocessing; advanced 8 heartbeat cycles past the 60s abort threshold and captureddiagnosticLogger.warn. The same harness was run once with the patch reverted to capture the "before" lines.reason=blocked_tool_call classification=blocked_tool_callline no longer appears for the follow-up turn; the orphanedbashmarker is evicted at reply completion and the follow-up turn is handled as a recoverable stale lane (release_lane), so the session converges instead of repeatedly re-blocking.Verification
Suites run after the patch:
diagnostic,reply-run-registry,diagnostic-stability,diagnostic-stuck-session-recovery.runtime/.integration,diagnostic-session-attention,diagnostics-otel,diagnostics-prometheus— all green.tsgocore + core-test + extensions + extensions-test clean; oxlint/oxfmt clean. New regression tests:reply-run-registry: evicts an orphaned native tool when a reply run completes without its completion.diagnostic: evicts orphaned tool/model markers when the last owner ends without clearing run activity; keeps tool markers when an inner embedded run is still active.diagnostics-otel/diagnostics-prometheus: export thesession.activity.evictedcounter by total markers without leaking raw ids.Closes #87310