Skip to content

Session lock auto-cleanup on staleness detection #87779

@todd-chisel

Description

@todd-chisel

Summary

Session JSONL lock files (.jsonl.lock) can become stale even when the gateway PID is alive and actively holding them, causing file lock stale errors in the sessions_send / sessions_spawn paths. This requires manual openclaw sessions cleanup intervention.

Background

Over the past 3 days (2026-05-26 to 2026-05-28), the OpenClaw gateway has experienced recurring "file lock stale" substrate failures affecting multiple agents simultaneously. The pattern:

  • Lock files for session JSONL files are flagged as "stale" by the runtime
  • The gateway PID is alive and actively working
  • No orphaned .lock files exist on disk at failure time
  • Logs show EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released

This appears to be a race condition in embedded-session lock management during concurrent sessions_send / sub-agent spawn operations, particularly affecting high-volume agents (woodhouse, fleet-ops, PMs).

Evidence:

Full forensic: [internal audit doc available on request]

Current Workaround

We've deployed a cron-based auto-cleanup sniffer:

  • Runs every 5 minutes
  • Scans for locks >5 minutes old with no active process
  • Auto-executes openclaw sessions cleanup on P1/P0 alerts
  • Alerts operators via inbox

This mitigates the immediate impact but does not address the root cause.

Requested Feature

Auto-cleanup on lock staleness detection:

  1. When the runtime detects a stale lock internally, automatically run cleanup (equivalent to openclaw sessions cleanup) before retrying the operation
  2. Log the auto-cleanup event for forensics
  3. Optional: expose a config flag to disable auto-cleanup if operators want manual control

Benefits:

  • Eliminates the 5-15 minute window between lock staleness and cron-driven cleanup
  • Reduces fleet-wide stuck-but-alive agent occurrences
  • Provides immediate self-healing for transient lock races

Alternative: Root Cause Fix

If the lock-staleness detection itself is buggy (i.e., the lock is NOT actually stale, but the runtime incorrectly thinks it is), then the root cause is in the lock validation logic. In that case:

  1. Review the EmbeddedAttemptSessionTakeoverError path
  2. Check if lock acquisition/release during sessions_send has a race window
  3. Ensure lock validation accounts for concurrent operations on the same session file

We're happy to provide additional forensic data (logs, stack traces, timing) if that helps diagnose the root cause.

Impact

Severity: P1 (fleet-degrading, not P0 because manual mitigation exists)

Frequency: 3-4 instances in 3 days across 10+ agents

Affected workflows:

  • Inter-agent messaging (sessions_send)
  • Sub-agent spawning (sessions_spawn)
  • Heartbeat coordination (agents appear "stuck but alive")

Environment

  • OpenClaw version: [current production, can provide specific version if needed]
  • Gateway uptime at first occurrence: 347h
  • Gateway uptime at recurrence: <2h (shows restart does not permanently resolve)
  • Affected session types: DM sessions (high concurrency), heartbeat sessions

Note: This issue is filed in parallel with our cron-based workaround deployment (Item A, Path 2 from internal substrate repair commission). We're requesting the upstream feature to enable eventual removal of the cron workaround once the feature is stable.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions