pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout

## Summary

When a session accumulates a large trajectory file (50MB+), `pi-trajectory-flush` exceeds its 10s timeout. After timeout, the cleanup continues running in the background but the event loop remains 100% saturated for 25+ minutes, making the gateway completely unresponsive to new messages.

## Environment

- OpenClaw: 2026.5.2 (installed via npm)
- OS: macOS 15.4, Apple Silicon (arm64)
- Node: v22
- Gateway: local, port 18789
- Model: deepseek/deepseek-v4-flash

## Root Cause Analysis

### The chain of failures

1. **Session accumulates massive trajectory**: A Feishu session processed a wiki reorganization task involving 6,743 files. Each tool output (file lists, directory structures) was recorded as a trajectory event, resulting in a **51MB trajectory file with 749 events**.

2. **Flush exceeds timeout**: At turn end, `pi-trajectory-flush` tries to drain the queued file writer. With 50MB+ of pending writes, it exceeds the **hardcoded 10s timeout** in `runAgentCleanupStep`.

3. **Timeout doesn't abort the flush**: The `Promise.race` in `runAgentCleanupStep` only logs a warning — the underlying `trajectoryRecorder.flush()` promise continues running indefinitely.

4. **Event loop saturation**: The `safeJsonStringify` serialization + async file write chain blocks the Node.js event loop at 100% utilization, with P99 delays reaching **34,728ms**.

5. **Gateway unresponsive**: New messages arrive as `queued` instead of `immediate`. The session lane remains occupied by cleanup maintenance. Total downtime: **25+ minutes** until forced restart (SIGKILL required).

### Code paths involved

- `runAgentCleanupStep` (`attempt.tool-run-context-B2TarhD3.js:440`): Hardcoded 10s timeout, no abort mechanism
- `QueuedFileWriter.flush()` (`runtime-qu4g1jFz.js`): Drains entire promise chain, no backpressure
- `safeJsonStringify` (`safe-json-DCDclho7.js:80`): Synchronous serialization of large event objects
- `createTrajectoryRuntimeRecorder` (`runtime-qu4g1jFz.js:143`): `maxFileBytes=52428800` (50MB cap exists but doesn't prevent large files)

### Key diagnostic logs

```
agent cleanup timed out: runId=ffdf596f-... sessionId=425f129b-... step=pi-trajectory-flush timeoutMs=10000
liveness warning: eventLoopDelayP99Ms=34728.8 eventLoopUtilization=1 active=0 waiting=0 queued=1
liveness warning: eventLoopDelayP99Ms=27799.8 eventLoopUtilization=1 active=0 waiting=0 queued=1  (25 min later, still stalled)
```

### Session data scale

```
agent/main/sessions/: 220MB total
  8f42a5aa-*.trajectory.jsonl: 51MB (749 events)
  7695a95f-*.trajectory.jsonl: 24MB (572 events)
  425f129b-*.trajectory.jsonl: 17MB (480 events)
```

### Related issues

- #75839 — Same flush timeout, different perspective
- #76340 — Event loop regression tracking
- #77115 — Stuck session ghost with similar event loop symptoms
- #76421 — Gateway timeout after event loop stall

## Proposed solutions

1. **Make cleanup abortable**: Pass an `AbortSignal` to `runAgentCleanupStep` so the flush can be stopped after timeout, rather than continuing in the background.

2. **Streaming/batched writes for trajectory**: Replace per-event `appendFile` with a `WriteStream` that buffers writes and yields the event loop between batches.

3. **Dynamic timeout**: Scale cleanup timeout based on pending queue size (e.g., 10s base + 1s per 100 queued events).

4. **Trajectory rotation**: Start a new trajectory file when the current one exceeds N MB (e.g., 10MB), preventing any single file from growing too large.

5. **Short-term workaround**: Expose `OPENCLAW_TRAJECTORY_CLEANUP_TIMEOUT_MS` env var to allow users to increase the timeout for large sessions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout #77124

Summary

Environment

Root Cause Analysis

The chain of failures

Code paths involved

Key diagnostic logs

Session data scale

Related issues

Proposed solutions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout #77124

Description

Summary

Environment

Root Cause Analysis

The chain of failures

Code paths involved

Key diagnostic logs

Session data scale

Related issues

Proposed solutions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions