-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
Session lock auto-cleanup on staleness detection #87779
Copy link
Copy link
Open
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
Session JSONL lock files (
.jsonl.lock) can become stale even when the gateway PID is alive and actively holding them, causingfile lock staleerrors in thesessions_send/sessions_spawnpaths. This requires manualopenclaw sessions cleanupintervention.Background
Over the past 3 days (2026-05-26 to 2026-05-28), the OpenClaw gateway has experienced recurring "file lock stale" substrate failures affecting multiple agents simultaneously. The pattern:
.lockfiles exist on disk at failure timeEmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was releasedThis appears to be a race condition in embedded-session lock management during concurrent
sessions_send/ sub-agent spawn operations, particularly affecting high-volume agents (woodhouse, fleet-ops, PMs).Evidence:
openclaw sessions cleanupclears the issue temporarilyFull forensic: [internal audit doc available on request]
Current Workaround
We've deployed a cron-based auto-cleanup sniffer:
openclaw sessions cleanupon P1/P0 alertsThis mitigates the immediate impact but does not address the root cause.
Requested Feature
Auto-cleanup on lock staleness detection:
openclaw sessions cleanup) before retrying the operationBenefits:
Alternative: Root Cause Fix
If the lock-staleness detection itself is buggy (i.e., the lock is NOT actually stale, but the runtime incorrectly thinks it is), then the root cause is in the lock validation logic. In that case:
EmbeddedAttemptSessionTakeoverErrorpathsessions_sendhas a race windowWe're happy to provide additional forensic data (logs, stack traces, timing) if that helps diagnose the root cause.
Impact
Severity: P1 (fleet-degrading, not P0 because manual mitigation exists)
Frequency: 3-4 instances in 3 days across 10+ agents
Affected workflows:
sessions_send)sessions_spawn)Environment
Note: This issue is filed in parallel with our cron-based workaround deployment (Item A, Path 2 from internal substrate repair commission). We're requesting the upstream feature to enable eventual removal of the cron workaround once the feature is stable.