[Bug]: Auto-compaction leaves session JSONL write lock held after timeout, blocking all later Discord turns

### Bug type

Regression (worked before, now fails)

### Beta release blocker

No

### Summary

OpenClaw `2026.5.18` can finish an Anthropic/Opus Discord run, enter post-run auto-compaction, and then leave the session JSONL write lock held after the compaction path times out.

After that, every new request in the same Discord channel session waits `60000ms` for the session file lock and fails before the agent can reply:

```text
SessionWriteLockTimeoutError: session file locked (timeout 60000ms)
```

The only observed recovery was a Gateway restart, which removed the live lock state and allowed the channel to accept requests again.

This appears related to existing session-lock/event-loop/compaction reliability reports, but this reproduction is narrower: a successful Opus run is followed by auto-compaction that holds the same session JSONL lock long enough to make all subsequent channel turns fail with no useful in-channel recovery.

### Steps to reproduce

1. Run OpenClaw Gateway as a user systemd service with Discord enabled.
2. Use a Discord channel session with Anthropic Opus as the active model.
3. Start a larger file-producing task so the session crosses the auto-compaction threshold.
4. Let the assistant finish the requested work.
5. Observe post-run auto-compaction start for the same session.
6. Send another user request in the same Discord channel while the compaction path is stuck.
7. Observe the new request wait for the existing JSONL lock and fail after `60000ms`.

Observed reproduction:

- Discord channel: `#mws`
- Session key: `agent:main:discord:channel:1506258704541159484`
- Session file: `/home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl`
- Lock file: `/home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock`
- Run id: `7d2170b5-3733-4439-8451-cad42efa577b`

The final assistant answer for the original Opus run was written to the JSONL around `2026-05-19T13:33:13.804Z`. The session lock was then created immediately after for the same gateway process:

```json
{
  "pid": 963591,
  "createdAt": "2026-05-19T13:33:13.808Z",
  "starttime": 335126537
}
```

Later user requests in the same channel failed at `13:45`, `13:46`, and `13:54 UTC` while waiting for the same lock.

### Expected behavior

Post-run auto-compaction should not leave a live session write lock behind after timeout or abort.

Expected behavior:

- compaction releases the session JSONL lock on success, failure, timeout, or cancellation
- subsequent user turns in the same Discord channel are not blocked by stale in-process compaction state
- if compaction cannot complete, OpenClaw surfaces a recoverable channel/session error
- a Gateway restart should not be required to make the channel usable again
- the stale-lock check should consider both PID and process start time, and should have a cleanup path for locks left by failed compaction

### Actual behavior

The original Opus run completed useful work and wrote its final assistant output to the session JSONL.

Immediately afterward, auto-compaction held the session write lock. The compaction path timed out, but the lock remained held by the live Gateway process. New Discord requests in the same channel then failed before an embedded agent could start or reply.

User-visible result:

- Discord typing/traffic stops
- no final or error answer reaches the channel for later requests
- each new request waits about 60 seconds and fails
- the channel remains unusable until Gateway restart

### OpenClaw version

OpenClaw 2026.5.18

### Operating system

Ubuntu

### Install method

npm global

### Model

claude-opus-4-7

### Provider / routing chain

anthropic/claude-opus-4-7 -> OpenClaw embedded run -> Discord channel session -> post-run auto-compaction

### Additional provider/model setup details

Anthropic was used through the normal OpenClaw embedded runner path.

The incident happened after changing Discord group visible replies to automatic delivery to work around a separate `message` tool argument issue. The write-lock failure is independent of that delivery setting: the failing path is session persistence/auto-compaction before any later assistant reply can be generated.

The same environment also has separate reports for:

- Codex app-server turns stalling after `item/completed`
- model-generated `SendMessage` arguments being rejected instead of normalized to `message`

Those are distinct symptoms. This report is specifically about the session JSONL lock left behind by post-run auto-compaction.

### Logs, screenshots, and evidence

```shell
Original compaction timeout signal:


May 19 13:43:45 casper node[963591]:
2026-05-19T13:43:45.077+00:00 [agent/embedded]
embedded run timeout reached during compaction; extending deadline:
runId=7d2170b5-3733-4439-8451-cad42efa577b
sessionId=49e71c56-dcbc-40ab-be04-4a92fd2230be
extraMs=900000



May 19 13:44:14 casper node[963591]:
CommandLaneTaskTimeoutError: Command lane "main" task timed out after 930000ms


Subsequent requests failed waiting for the same session JSONL lock:


May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.707+00:00 [diagnostic]
lane task error: lane=main durationMs=61155
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
2026-05-19T13:45:17.712+00:00 [diagnostic]
lane task error: lane=session:agent:main:discord:channel:1506258704541159484 durationMs=61165
error="SessionWriteLockTimeoutError: session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock"



May 19 13:45:17 casper node[963591]:
Embedded agent failed before reply:
session file locked (timeout 60000ms):
pid=963591 /home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


The same pattern repeated:


May 19 13:46:27 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
May 19 13:54:04 ... SessionWriteLockTimeoutError ... 49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock


Lock file observed before Gateway restart:


/home/casper/.openclaw/agents/main/sessions/49e71c56-dcbc-40ab-be04-4a92fd2230be.jsonl.lock
mtime: 2026-05-19 13:33:13.807249173 +0000
pid: 963591
createdAt: 2026-05-19T13:33:13.808Z


After a Gateway restart, the lock file was gone and the channel could accept new work again:


LOCK_GONE


Related public issues found:

- https://github.com/openclaw/openclaw/issues/43367 mentions session lock timeouts and detached background work in multi-agent orchestration.
- https://github.com/openclaw/openclaw/issues/75882 mentions gateway stalls, lane waits, file lock timeouts, and missed replies.

Neither is an exact match for this post-run auto-compaction lock leak in a single Discord channel session.
```

### Impact and severity

Severity: High / work-blocking.

Impact:

- the affected Discord channel session becomes unusable
- every new request waits about 60 seconds and fails before reply
- users see no actionable recovery message in the channel
- completed work may exist on disk, but the user receives no reliable completion signal
- the only practical recovery observed is a Gateway restart


### Additional information

Immediate workaround:

1. Restart the Gateway cleanly.
2. Verify the affected lock file is gone.
3. Retry work in the channel only after the lock is cleared.

Operational workaround until fixed:

- keep high-context Discord sessions short
- use fresh channel/session context for large site/build tasks before auto-compaction is likely
- split large tasks into smaller turns
- avoid continuing work in a session that is close to compaction/context limits
- monitor for old `*.jsonl.lock` files in active session directories
- do not manually delete a lock while its owning Gateway PID is still alive unless there is strong evidence the lock is stale and the process is no longer using it
- if the lock owner is the live Gateway process and the channel is blocked, prefer a clean Gateway restart over deleting the lock file

Suggested upstream fix areas:

- ensure session write locks are released in `finally` blocks around compaction
- add timeout/cancellation cleanup for compaction-held session locks
- make lock diagnostics identify the owning operation, not only the owning PID
- surface a user-visible recovery event when compaction blocks a later interactive turn
- optionally isolate compaction writes from normal interactive turn acquisition so a failed compaction cannot starve new user turns indefinitely

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Auto-compaction leaves session JSONL write lock held after timeout, blocking all later Discord turns #84193

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Auto-compaction leaves session JSONL write lock held after timeout, blocking all later Discord turns #84193

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions