Skip to content

[Bug]: Auto-compaction leaves session JSONL write lock held after timeout, blocking all later Discord turns #84193

@jkoopmann

Description

@jkoopmann

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

OpenClaw 2026.5.18 can finish an Anthropic/Opus Discord run, enter post-run auto-compaction, and then leave the session JSONL write lock held after the compaction path times out.

After that, every new request in the same Discord channel session waits 60000ms for the session file lock and fails before the agent can reply:

SessionWriteLockTimeoutError: session file locked (timeout 60000ms)

The only observed recovery was a Gateway restart, which removed the live lock state and allowed the channel to accept requests again.

This appears related to existing session-lock/event-loop/compaction reliability reports, but this reproduction is narrower: a successful Opus run is followed by auto-compaction that holds the same session JSONL lock long enough to make all subsequent channel turns fail with no useful in-channel recovery.

Steps to reproduce

  1. Run OpenClaw Gateway as a user systemd service with Discord enabled.
  2. Use a Discord channel session with Anthropic Opus as the active model.
  3. Start a larger file-producing task so the session crosses the auto-compaction threshold.
  4. Let the assistant finish the requested work.
  5. Observe post-run auto-compaction start for the same session.
  6. Send another user request in the same Discord channel while the compaction path is stuck.
  7. Observe the new request wait for the existing JSONL lock and fail after 60000ms.

Observed reproduction:

  • Discord channel: #mws
  • Session key: agent:main:discord:channel:1506258704541159484
  • Session file: /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl
  • Lock file: /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
  • Run id: 7d2170b5-3733-4439-8451-cad42efa577b

The final assistant answer for the original Opus run was written to the JSONL around 2026-05-19T13:33:13.804Z. The session lock was then created immediately after for the same gateway process:

{
  "pid": 963591,
  "createdAt": "2026-05-19T13:33:13.808Z",
  "starttime": 335126537
}

Later user requests in the same channel failed at 13:45, 13:46, and 13:54 UTC while waiting for the same lock.

Expected behavior

Post-run auto-compaction should not leave a live session write lock behind after timeout or abort.

Expected behavior:

  • compaction releases the session JSONL lock on success, failure, timeout, or cancellation
  • subsequent user turns in the same Discord channel are not blocked by stale in-process compaction state
  • if compaction cannot complete, OpenClaw surfaces a recoverable channel/session error
  • a Gateway restart should not be required to make the channel usable again
  • the stale-lock check should consider both PID and process start time, and should have a cleanup path for locks left by failed compaction

Actual behavior

The original Opus run completed useful work and wrote its final assistant output to the session JSONL.

Immediately afterward, auto-compaction held the session write lock. The compaction path timed out, but the lock remained held by the live Gateway process. New Discord requests in the same channel then failed before an embedded agent could start or reply.

User-visible result:

  • Discord typing/traffic stops
  • no final or error answer reaches the channel for later requests
  • each new request waits about 60 seconds and fails
  • the channel remains unusable until Gateway restart

OpenClaw version

OpenClaw 2026.5.18

Operating system

Ubuntu

Install method

npm global

Model

claude-opus-4-7

Provider / routing chain

anthropic/claude-opus-4-7 -> OpenClaw embedded run -> Discord channel session -> post-run auto-compaction

Additional provider/model setup details

Anthropic was used through the normal OpenClaw embedded runner path.

The incident happened after changing Discord group visible replies to automatic delivery to work around a separate message tool argument issue. The write-lock failure is independent of that delivery setting: the failing path is session persistence/auto-compaction before any later assistant reply can be generated.

The same environment also has separate reports for:

  • Codex app-server turns stalling after item/completed
  • model-generated SendMessage arguments being rejected instead of normalized to message

Those are distinct symptoms. This report is specifically about the session JSONL lock left behind by post-run auto-compaction.

Logs, screenshots, and evidence

Original compaction timeout signal:


May 19 13:43:45 casper node[963591]:
2026-05-19T13:43:45.077+00:00 [agent/embedded]
embedded run timeout reached during compaction; extending deadline:
runId=7d2170b5-3733-4439-8451-cad42efa577b
sessionId=49e71c56-dcbc-40ab-be04-4a92fd2230be
extraMs=900000



May 19 13:44:14 casper node[963591]:
CommandLaneTaskTimeoutError: Command lane "main" task timed out after 930000ms


Subsequent requests failed waiting for the same session JSONL lock:


May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.707+00:00 [diagnostic]
lane task error: lane=main durationMs=61155
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.712+00:00 [diagnostic]
lane task error: lane=session:agent:main:discord:channel:1506258704541159484 durationMs=61165
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
Embedded agent failed before reply:
session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


The same pattern repeated:


May 19 13:46:27 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
May 19 13:54:04 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


Lock file observed before Gateway restart:


/home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
mtime: 2026-05-19 13:33:13.807249173 +0000
pid: 963591
createdAt: 2026-05-19T13:33:13.808Z


After a Gateway restart, the lock file was gone and the channel could accept new work again:


LOCK_GONE


Related public issues found:

- https://github.com/openclaw/openclaw/issues/43367 mentions session lock timeouts and detached background work in multi-agent orchestration.
- https://github.com/openclaw/openclaw/issues/75882 mentions gateway stalls, lane waits, file lock timeouts, and missed replies.

Neither is an exact match for this post-run auto-compaction lock leak in a single Discord channel session.

Impact and severity

Severity: High / work-blocking.

Impact:

  • the affected Discord channel session becomes unusable
  • every new request waits about 60 seconds and fails before reply
  • users see no actionable recovery message in the channel
  • completed work may exist on disk, but the user receives no reliable completion signal
  • the only practical recovery observed is a Gateway restart

Additional information

Immediate workaround:

  1. Restart the Gateway cleanly.
  2. Verify the affected lock file is gone.
  3. Retry work in the channel only after the lock is cleared.

Operational workaround until fixed:

  • keep high-context Discord sessions short
  • use fresh channel/session context for large site/build tasks before auto-compaction is likely
  • split large tasks into smaller turns
  • avoid continuing work in a session that is close to compaction/context limits
  • monitor for old *.jsonl.lock files in active session directories
  • do not manually delete a lock while its owning Gateway PID is still alive unless there is strong evidence the lock is stale and the process is no longer using it
  • if the lock owner is the live Gateway process and the channel is blocked, prefer a clean Gateway restart over deleting the lock file

Suggested upstream fix areas:

  • ensure session write locks are released in finally blocks around compaction
  • add timeout/cancellation cleanup for compaction-held session locks
  • make lock diagnostics identify the owning operation, not only the owning PID
  • surface a user-visible recovery event when compaction blocks a later interactive turn
  • optionally isolate compaction writes from normal interactive turn acquisition so a failed compaction cannot starve new user turns indefinitely

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingclawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions