Bug Report
Version: 2026.5.22 (also present in 2026.5.20)
Platform: Docker (Debian-based), init: true (tini PID 1), Node.js v24.14.0
Summary
The gateway acquires a write lock on a session .jsonl file during an embedded agent run. If the run times out or fails, the lock is never released. All subsequent requests to that session block for 60 seconds waiting on the lock, then fail with SessionWriteLockTimeoutError. The retained session context leaks memory, contributing to monotonic RSS growth.
Reproduction Steps
- Configure an agent with
timeoutSeconds: 300 (or any finite timeout)
- Trigger an embedded agent run (e.g., via cron
agentTurn payload, or subagent spawn)
- Ensure the run exceeds the timeout, or a tool call inside it errors/stalls
- Observe that the
.lock file on the session .jsonl persists indefinitely
- Any subsequent request to the same session fails with
SessionWriteLockTimeoutError
Observed Behavior
The lock file remains on disk with pid matching the gateway process. The gateway itself holds the lock — it is not an orphaned child process. The lock's maxHoldMs grows unboundedly.
Log Evidence
2026-05-24T10:10:19.763+00:00 [agent/embedded] embedded run timeout: runId=e295c559-441d-459e-aea1-a5f2268a8a10 sessionId=555b0189-2ff8-483b-87f5-ebab41995342 timeoutMs=300000
2026-05-24T10:11:51.382+00:00 [ws] ⇄ res ✗ agent errorCode=UNAVAILABLE errorMessage=SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock: code=OPENCLAW_SESSION_WRITE_LOCK_TIMEOUT
2026-05-24T10:14:01.488+00:00 Embedded agent failed before reply: session file locked (timeout 60000ms): pid=7 /root/.openclaw/agents/main/sessions/555b0189-2ff8-483b-87f5-ebab41995342.jsonl.lock
Lock File Contents (19 minutes after timeout)
{
"pid": 7,
"createdAt": "2026-05-24T10:29:15.171Z",
"maxHoldMs": 1020000,
"starttime": 76698844
}
Note: pid: 7 is the gateway process itself (PID 1 is tini). The lock is held by the gateway, not an orphaned child.
Impact
- Memory leak: Each stuck lock holds the full session object graph in memory (message history, tool results, thinking blocks). With cron jobs spawning dozens of
agentTurn sessions, RSS grows monotonically at 40–160 MB/min until OOM.
- Session unavailability: The locked session becomes permanently inaccessible until the container is restarted or the lock file is manually deleted.
- Cascading failures: With
runRetries at default (max: 160), each retry attempt against the locked session adds more to the retained graph without releasing the previous attempt.
Root Cause Hypothesis
The embedded run timeout/error handler does not release the session file write lock in its cleanup path. The lock acquisition likely happens in the session write pipeline, and the timeout interrupts execution after the lock is acquired but before the finally block (or equivalent cleanup) runs.
Specifically:
- Gateway acquires write lock on
{session}.jsonl.lock
- Embedded run starts (tool calls, model API calls)
- Run times out or tool call fails
- Error propagates but bypasses the lock release path
- Lock file persists on disk, gateway retains in-memory references
Environment Details
- Docker container with
init: true (tini as PID 1 for zombie reaping)
- 74 cron jobs using
agentTurn payload (each spawns a full LLM session)
- Multiple agents configured in single container
- Gateway PID: 7, Docker memory limit: 6GB
- Host: 2× Xeon Gold 6248 (40C/80T), 125GB RAM
Workaround
- Automated stale lock cleanup (delete
.lock files older than 2 minutes via cron)
- Reduce
runRetries.max from 160 to 32 (limits churn on stuck sessions)
- Set
timeoutSeconds and subagents.runTimeoutSeconds to finite values (prevents unbounded runs)
- Container restart clears all locks and releases retained memory
Expected Behavior
When an embedded run times out or fails, the gateway should:
- Release the session file write lock immediately
- Release all in-memory references to the session context
- Log the timeout/failure cleanly
- Allow subsequent requests to the session to proceed normally
Bug Report
Version: 2026.5.22 (also present in 2026.5.20)
Platform: Docker (Debian-based),
init: true(tini PID 1), Node.js v24.14.0Summary
The gateway acquires a write lock on a session
.jsonlfile during an embedded agent run. If the run times out or fails, the lock is never released. All subsequent requests to that session block for 60 seconds waiting on the lock, then fail withSessionWriteLockTimeoutError. The retained session context leaks memory, contributing to monotonic RSS growth.Reproduction Steps
timeoutSeconds: 300(or any finite timeout)agentTurnpayload, or subagent spawn).lockfile on the session.jsonlpersists indefinitelySessionWriteLockTimeoutErrorObserved Behavior
The lock file remains on disk with
pidmatching the gateway process. The gateway itself holds the lock — it is not an orphaned child process. The lock'smaxHoldMsgrows unboundedly.Log Evidence
Lock File Contents (19 minutes after timeout)
{ "pid": 7, "createdAt": "2026-05-24T10:29:15.171Z", "maxHoldMs": 1020000, "starttime": 76698844 }Note:
pid: 7is the gateway process itself (PID 1 is tini). The lock is held by the gateway, not an orphaned child.Impact
agentTurnsessions, RSS grows monotonically at 40–160 MB/min until OOM.runRetriesat default (max: 160), each retry attempt against the locked session adds more to the retained graph without releasing the previous attempt.Root Cause Hypothesis
The embedded run timeout/error handler does not release the session file write lock in its cleanup path. The lock acquisition likely happens in the session write pipeline, and the timeout interrupts execution after the lock is acquired but before the
finallyblock (or equivalent cleanup) runs.Specifically:
{session}.jsonl.lockEnvironment Details
init: true(tini as PID 1 for zombie reaping)agentTurnpayload (each spawns a full LLM session)Workaround
.lockfiles older than 2 minutes via cron)runRetries.maxfrom 160 to 32 (limits churn on stuck sessions)timeoutSecondsandsubagents.runTimeoutSecondsto finite values (prevents unbounded runs)Expected Behavior
When an embedded run times out or fails, the gateway should: