-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
Session write lock timeout freezes gateway when session file is large (>300KB) #85762
Copy link
Copy link
Closed
Closed
Copy link
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.Crash, hang, restart loop, or process-level availability failure.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.Crash, hang, restart loop, or process-level availability failure.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
When a session transcript grows large (300KB+), writing to the session file can take longer than the default lock acquisition timeout (60 seconds). This causes repeated
SessionWriteLockTimeoutErrorfailures, rendering the gateway unresponsive and requiring manual restarts.Environment
Steps to Reproduce
Expected Behavior
If a session write lock is held by a process that exceeds
maxHoldMs(currently 5 minutes), the lock should be reclaimable by other processes during acquisition.Actual Behavior
The
shouldReclaimcallback inacquireSessionWriteLockchecks 6 stale conditions (missing PID, dead PID, recycled PID, too-old per staleMs, non-openclaw owner, orphan self-lock) but does NOT checkmaxHoldMs. So even when a live process holds the lock for well over 5 minutes, other processes wait 60 seconds then throwSessionWriteLockTimeoutError.Log evidence:
The watchdog timer (
runLockWatchdogCheck) DOES enforcemaxHoldMs, but it only runs every 60 seconds and is a separate mechanism from lock acquisition. The acquisition timeout fires before the watchdog can act.Root Cause
In
src/agents/session-write-lock.ts, theshouldReclaimcallback insideacquireSessionWriteLocknever consultsmaxHoldMs. ThemaxHoldMsvalue is stored as metadata but only used by the background watchdog, not during lock acquisition.The relevant
shouldReclaimlogic only usesstaleMs(30min default), nevermaxHoldMs(5min default). A lock could be held for 10 minutes by a live process, andshouldReclaimwould still returnfalsebecause the PID is alive and the age hasn't exceededstaleMs.Workaround
Set environment variables to increase timeout and reduce staleness threshold:
Proposed Fix
In
shouldReclaim, also enforcemaxHoldMs: ifnowMs - createdAtexceedsmaxHoldMs, treat the lock as reclaimable (add "hold-exceeded" to staleReasons). This makes the acquisition path respect the samemaxHoldMsconstraint that the watchdog already enforces.