[Bug] Bloated session jsonl (444 MB) hangs gateway via String.prototype.replace — diagnose with sample+lsof

# [Bug] Bloated session JSONL (444 MB) blocks event loop via String.prototype.replace — gateway becomes unresponsive on agent message processing

## Summary

A single session jsonl file in `~/.openclaw/agents/main/sessions/` grew unbounded to **444 MB / 157,879 lines** over 2 days. Whenever the agent processed a new message (via Discord OR TUI), the gateway loaded this file's content into a JS string and ran `String.prototype.replace(/regex/, ...)` on it, blocking the main thread indefinitely. The gateway appeared "running" to launchd but was completely unresponsive — `openclaw gateway health` timed out at 10000ms, signals were ignored (event loop blocked), and the only recovery was `kill -9` + archiving the bloated file.

This bug pattern is **not currently filed** as far as I can find, despite likely affecting many users silently. I am submitting it because the diagnostic technique I used (`sample` + `lsof`) appears to be novel for this codebase and the root-cause finding is valuable.

## Environment

- **OpenClaw**: 2026.4.10 (44e5b62)
- **OS**: macOS 26.4 arm64 (Mac mini Apple Silicon, M2 Pro)
- **Node**: 25.6.1 (homebrew)
- **Install**: `/opt/homebrew/lib/node_modules/openclaw` via npm global
- **Service**: macOS LaunchAgent (`~/Library/LaunchAgents/ai.openclaw.gateway.plist`), bind=loopback
- **Channels**: Discord (2 bots), Telegram, WhatsApp (2 accounts)
- **Plugins**: 7 (discord, llm-task, lossless-claw, luca-approval, memory-core, telegram, whatsapp)
- **Memory backend**: was qmd, switched to builtin during debugging (separate issue, kept)
- **Active main agent session count**: 68 (after cleanup)

## Symptoms (the misleading ones that wasted hours of debugging)

The bloated session manifests as several apparently-unrelated symptoms that lead investigators down wrong paths:

1. **`openclaw gateway health` times out at 10000ms** with `Discord: failed (operation aborted)` while Telegram is fast (50ms)
2. **Process state `STAT=R, WCHAN=-`** — actively running on CPU, not blocked on I/O
3. **`SIGTERM` and `SIGINT` are ignored** — kill returns 0 but the process never exits (because the event loop is blocked, the signal handler can never run). Looks identical to a "broken signal trap" bug
4. **Discord bots show "logged in"** in earlier log lines but stop responding to new messages — looks like a Discord connection issue
5. **TUI sometimes works for a single message then dies** — looks like a Discord-specific bug because the visible failure is on Discord but TUI is also affected
6. **`launchctl print` shows the process as `state = running`** with `runs = 1` and `last exit code = (never exited)` — looks healthy to launchd
7. **No errors in `gateway.err.log`** — the log just goes silent at the moment of the hang
8. **Health-monitor restart cycle every 5-7 minutes** — looks like a connection issue

I spent ~6 hours chasing these symptoms across multiple wrong rabbit holes (mDNS cascade #64484, missing bundled deps #59820, QMD sidecar #64351, Discord race #57075, streaming partial races) — each of which is a *real bug* but none of which were the actual cause of the hangs in this case.

## The smoking gun: `sample` + `lsof`

I want to flag this diagnostic technique because **nobody else in the openclaw issue tracker uses it** and it would have saved me hours.

### Step 1: get a stack trace of the busy-loop process

```bash
sample <pid> 3 -mayDie 2>&1 | head -120
```

For my hung gateway pid (35626), the entire 3 seconds of samples (2141 frames) showed the same call stack:

```
node::SpinEventLoopInternal
  → uv_run → uv__io_poll → uv__async_io → uv__work_done
    → node::MakeLibuvRequestCallback<uv_fs_s, ...>
      → node::fs::FileHandle::ClosePromise()::$_0::__invoke   ← file just closed
        → node::InternalCallbackScope::Close
          → v8::internal::MicrotaskQueue::PerformCheckpointInternal
            → v8::internal::Execution::TryRunMicrotasks
              → Builtins_RunMicrotasks
                → Builtins_PromiseFulfillReactionJob
                  → Builtins_AsyncFunctionAwaitResolveClosure   ← async fn resumed
                    → Builtins_InterpreterEntryTrampoline       ← JS function
                      → Builtins_StringPrototypeReplace            (1725 / 2141 samples = 80%)
                        → Builtins_RegExpReplace                   (1498 samples)
                          → v8::internal::Runtime_FlattenString    (227 samples)
                            → String::WriteToFlat2 → memmove        (214 samples)
                            → Heap::CollectGarbage → Heap::Scavenge (13 samples — GC under pressure)
                      → Builtins_StringPrototypeLastIndexOf        (387 samples)
                        → String::LastIndexOf → StringMatchBackwards
```

Translation: **80% of CPU time is in `String.prototype.replace` with a regex on a huge string**. The cons-string has to be flattened (via `memmove` of 214 samples worth of bytes), and v8 GC fires under memory pressure during the operation. This happens immediately after a file close callback resolves an awaited promise.

### Step 2: find what file was just read

```bash
lsof -p <pid> | grep -E "sessions|jsonl|memory|workspace"
```

For my pid:
```
node 35626 ... 46r REG 1,17 465681629 ... /Users/.../06082bfe-...jsonl
                                ^^^^^^^^^
                                465 MB read handle on a session jsonl file
```

**A session jsonl file open as read handle, 465 MB.** That's the input string for the regex.

### Step 3: confirm with file size scan

```bash
ls -lhS ~/.openclaw/agents/main/sessions/*.jsonl | head -10
```

```
-rw------- 444M Apr 11 11:42 06082bfe-16d8-490e-ad96-3092962fc7ab.jsonl   ← MONSTER
-rw------- 9.9M Mar 26       1b5f4f73-...
-rw------- 8.3M Apr  2       08df5e6f-...
-rw------- 7.8M Apr 11 12:24 c23c07af-...                                  ← current active main session (normal size)
-rw------- 5.3M Apr  9       8082ca14-...
-rw------- 5.1M Apr 10       aded36b2-...
```

**One session file is 50× larger than every other.** The main agent's other sessions max out at 9.9 MB. This one is 444 MB.

## Root cause: how did it get to 444 MB?

The session was created 2026-04-09 00:47 (during my DenchClaw install/cleanup drama, see #59820 thread for context). Last appended 2026-04-11 11:42 (about an hour before I noticed it).

Tail of the file shows the bloat pattern:
```jsonl
{"type":"message",...,"role":"assistant","content":[],"stopReason":"aborted","errorMessage":"Request was aborted"}
{"type":"message",...,"role":"assistant","content":[{"type":"text","text":"Hey, bin wieder da! 🫡 Sorry für die Stille — der Codex OAuth Token war expired und hat alle GPT-5.4 Sub-Agent Calls blockiert..."}]}
```

A pattern of:
- Aborted assistant turns (model call timeouts, OAuth expirations, gateway restarts mid-stream)
- Each abort is appended as a separate jsonl line
- Subagent retries that didn't terminate cleanly
- Each retry appends artifacts to the same session file
- Over 2 days this accumulated to 157,879 lines / 444 MB

There's no rotation, no size limit, no truncation policy. The session file just grows indefinitely.

## The fix (user-side workaround)

```bash
# 1. Stop the gateway (SIGKILL needed because event loop is blocked, signals can't run)
kill -9 <pid>

# 2. Archive (NOT delete) the monster session — preserve for forensics
mkdir -p ~/.openclaw/archive/sessions-archived-$(date +%Y-%m-%d)
mv ~/.openclaw/agents/main/sessions/06082bfe-*.jsonl* ~/.openclaw/archive/sessions-archived-$(date +%Y-%m-%d)/

# 3. Restart the LaunchAgent
launchctl bootstrap gui/$UID ~/Library/LaunchAgents/ai.openclaw.gateway.plist

# 4. Clean up orphaned entries in sessions.json
openclaw sessions cleanup --store ~/.openclaw/agents/main/sessions/sessions.json --enforce --fix-missing
```

After this, my Discord bot started responding within 1 second (`pong. Ich bin da.`), gateway `STAT` transitioned from `R` (busy) to `S` (sleeping/idle), and `openclaw gateway health` returned in 1108ms with all channels green. **Stable for hours after the fix.**

## Preventive monitoring (works today, no code changes needed)

While waiting for an upstream fix, users can detect the bug *before* it kills the gateway with a simple cron job. This is what I set up on my own machine after recovering from the incident:

**`~/.openclaw/scripts/check-bloated-sessions.sh`**

```bash
#!/bin/bash
# Detects openclaw session jsonl files that have grown beyond a safe size.
# Bloated sessions can hang the gateway via String.prototype.replace on huge strings.

set -uo pipefail

THRESHOLD_MB=50
LOGFILE="$HOME/.openclaw/logs/bloated-sessions.log"
SCAN_ROOT="$HOME/.openclaw/agents"

mkdir -p "$(dirname "$LOGFILE")"
ts=$(date +"%Y-%m-%dT%H:%M:%S%z")

results=$(find "$SCAN_ROOT" -type f -name "*.jsonl" -size +${THRESHOLD_MB}M -exec ls -lh {} \; 2>/dev/null)

if [ -n "$results" ]; then
  count=$(printf '%s\n' "$results" | wc -l | tr -d ' ')
  {
    echo "[$ts] WARNING: found $count session jsonl(s) over ${THRESHOLD_MB}MB (gateway hang risk)"
    printf '%s\n' "$results"
    echo ""
  } >> "$LOGFILE"

  # macOS native notification
  osascript -e "display notification \"Found $count bloated openclaw session(s) over ${THRESHOLD_MB}MB. Check ~/.openclaw/logs/bloated-sessions.log\" with title \"OpenClaw: bloated session detected\"" 2>/dev/null || true

  exit 1
else
  echo "[$ts] OK: no session jsonl over ${THRESHOLD_MB}MB" >> "$LOGFILE"
  exit 0
fi
```

**Crontab entry (daily at 9am):**

```
# OpenClaw bloated session monitor
0 9 * * * $HOME/.openclaw/scripts/check-bloated-sessions.sh
```

The cron runs **independently of openclaw itself**, so even if the gateway is hung, the check still runs and notifies you. On macOS the `osascript` line produces a native notification banner; on Linux you'd swap that for `notify-send` or whatever your DE uses.

**Why 50 MB is the right threshold:** I checked all my session files after the incident. Other healthy main-agent sessions max out at ~10 MB. Anything over 50 MB is suspicious; anything over 100 MB is almost certainly broken. `find -size +50M` catches the issue while it's still recoverable (smaller files = faster archive + less GC pressure during the read).

This is a 30-second setup that would have saved me ~6 hours of debugging if I'd known about the bug class. Recommending all openclaw users do this until upstream ships a real fix.

## Suggested upstream fixes

1. **Hard size limit on session jsonl files** (e.g. 50 MB) with auto-rotation to numbered files:
   ```
   <id>.jsonl       (current, capped)
   <id>.jsonl.001   (rotated)
   <id>.jsonl.002
   ...
   ```
   Sessions could optionally be loaded across all rotation files when needed, but the current "load entire session" code path would only ever see a small file.

2. **`openclaw doctor` should warn on bloated session files**, e.g. anything over 50 MB. Currently doctor checks plist/auth/cron/locks but never validates session file sizes. A simple `find ~/.openclaw/agents -name "*.jsonl" -size +50M` check would catch this in seconds.

3. **The agent processing pipeline should never `String.prototype.replace` on a multi-MB string in a single call**. Either stream/chunk the regex processing, or restrict the regex to only the most-recent N turns.

4. **Failed/aborted retries should not append to the active session file** — they should go to a quarantine file or be discarded. The retry path should not contribute to session bloat.

5. **`openclaw gateway health` should distinguish between "process running and serving" and "process running but event loop blocked"**. The current health check times out and reports `Discord: failed (operation aborted)` which is misleading — Discord is fine, the event loop is blocked. A direct event-loop liveness probe (e.g. `setImmediate` round-trip with a tight timeout) would catch this.

6. **Document `discovery.mdns.mode` and other escape-hatch config keys** publicly. The mdns kill switch (`mode: "off"`) isn't in any user-facing docs but is the documented workaround for issue #64484. Same likely applies to other internal config keys.

## Why this matters

This bug is silent. An openclaw user installs it, uses it for weeks, suddenly the gateway "just hangs" and they have no idea why. Restarting helps for a few minutes (until the agent processes another message that triggers the regex). The gateway looks healthy to every monitoring tool. Logs go silent at the moment of the hang. Signal handlers stop working.

If you don't know to run `sample` and `lsof` (and most users won't), you'll spend hours chasing wrong rabbit holes — exactly like I did. The presence of the issues #64484 (Bonjour cascade) and #59820 (missing bundled deps) actively misled me because both have the same surface symptoms (gateway hangs, signals ignored, slow health checks, Discord aborts).

The 444 MB session was a *separate, undiagnosed problem* that mimicked the symptoms of those other bugs, and I only found it through stack-trace analysis after fixing the other (real) issues didn't help.

## Related issues (cross-references)

- **#64484** — Bonjour mDNS cascade — different bug, similar symptoms, can co-occur
- **#59820** — Discord provider hangs / missing postinstall bundled deps — different bug, similar symptoms (signal handlers ignored), can co-occur
- **#64351** — QMD sidecar not auto-starting — different bug, can co-occur
- **#57075** — Discord gateway race condition — different bug, similar symptoms
- **#45102** — "Gateway requires frequent restart when processing long tasks" — **possibly the same bug as this one**, reported with less specific symptoms
- **#57534** — sessions.list slow due to inline skillsSnapshot bloat — related: session bloat slowing things down, but different code path
- **#41538** — update.run should flush open session JSONL transcripts before SIGTERM — related: session file lifecycle issues

## Acknowledgments

This investigation took ~6 hours of pair-debugging with an AI assistant (Claude Code via Claude Opus 4.6, separate from the openclaw gateway being debugged). The AI assistant ran the `sample`, `lsof`, and log-analysis steps and identified the regex hang signature. The diagnostic chain was novel — `sample` is not used elsewhere in the openclaw issue tracker as far as I can find.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Bloated session jsonl (444 MB) hangs gateway via String.prototype.replace — diagnose with sample+lsof #64767

[Bug] Bloated session JSONL (444 MB) blocks event loop via String.prototype.replace — gateway becomes unresponsive on agent message processing

Summary

Environment

Symptoms (the misleading ones that wasted hours of debugging)

The smoking gun: `sample` + `lsof`

Step 1: get a stack trace of the busy-loop process

Step 2: find what file was just read

Step 3: confirm with file size scan

Root cause: how did it get to 444 MB?

The fix (user-side workaround)

Preventive monitoring (works today, no code changes needed)

Suggested upstream fixes

Why this matters

Related issues (cross-references)

Acknowledgments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Bloated session jsonl (444 MB) hangs gateway via String.prototype.replace — diagnose with sample+lsof #64767

Description

[Bug] Bloated session JSONL (444 MB) blocks event loop via String.prototype.replace — gateway becomes unresponsive on agent message processing

Summary

Environment

Symptoms (the misleading ones that wasted hours of debugging)

The smoking gun: sample + lsof

Step 1: get a stack trace of the busy-loop process

Step 2: find what file was just read

Step 3: confirm with file size scan

Root cause: how did it get to 444 MB?

The fix (user-side workaround)

Preventive monitoring (works today, no code changes needed)

Suggested upstream fixes

Why this matters

Related issues (cross-references)

Acknowledgments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The smoking gun: `sample` + `lsof`