Skip to content

[Bug] Bloated session jsonl (444 MB) hangs gateway via String.prototype.replace — diagnose with sample+lsof #64767

@ppronobis

Description

@ppronobis

[Bug] Bloated session JSONL (444 MB) blocks event loop via String.prototype.replace — gateway becomes unresponsive on agent message processing

Summary

A single session jsonl file in ~/.openclaw/agents/main/sessions/ grew unbounded to 444 MB / 157,879 lines over 2 days. Whenever the agent processed a new message (via Discord OR TUI), the gateway loaded this file's content into a JS string and ran String.prototype.replace(/regex/, ...) on it, blocking the main thread indefinitely. The gateway appeared "running" to launchd but was completely unresponsive — openclaw gateway health timed out at 10000ms, signals were ignored (event loop blocked), and the only recovery was kill -9 + archiving the bloated file.

This bug pattern is not currently filed as far as I can find, despite likely affecting many users silently. I am submitting it because the diagnostic technique I used (sample + lsof) appears to be novel for this codebase and the root-cause finding is valuable.

Environment

  • OpenClaw: 2026.4.10 (44e5b62)
  • OS: macOS 26.4 arm64 (Mac mini Apple Silicon, M2 Pro)
  • Node: 25.6.1 (homebrew)
  • Install: /opt/homebrew/lib/node_modules/openclaw via npm global
  • Service: macOS LaunchAgent (~/Library/LaunchAgents/ai.openclaw.gateway.plist), bind=loopback
  • Channels: Discord (2 bots), Telegram, WhatsApp (2 accounts)
  • Plugins: 7 (discord, llm-task, lossless-claw, luca-approval, memory-core, telegram, whatsapp)
  • Memory backend: was qmd, switched to builtin during debugging (separate issue, kept)
  • Active main agent session count: 68 (after cleanup)

Symptoms (the misleading ones that wasted hours of debugging)

The bloated session manifests as several apparently-unrelated symptoms that lead investigators down wrong paths:

  1. openclaw gateway health times out at 10000ms with Discord: failed (operation aborted) while Telegram is fast (50ms)
  2. Process state STAT=R, WCHAN=- — actively running on CPU, not blocked on I/O
  3. SIGTERM and SIGINT are ignored — kill returns 0 but the process never exits (because the event loop is blocked, the signal handler can never run). Looks identical to a "broken signal trap" bug
  4. Discord bots show "logged in" in earlier log lines but stop responding to new messages — looks like a Discord connection issue
  5. TUI sometimes works for a single message then dies — looks like a Discord-specific bug because the visible failure is on Discord but TUI is also affected
  6. launchctl print shows the process as state = running with runs = 1 and last exit code = (never exited) — looks healthy to launchd
  7. No errors in gateway.err.log — the log just goes silent at the moment of the hang
  8. Health-monitor restart cycle every 5-7 minutes — looks like a connection issue

I spent ~6 hours chasing these symptoms across multiple wrong rabbit holes (mDNS cascade #64484, missing bundled deps #59820, QMD sidecar #64351, Discord race #57075, streaming partial races) — each of which is a real bug but none of which were the actual cause of the hangs in this case.

The smoking gun: sample + lsof

I want to flag this diagnostic technique because nobody else in the openclaw issue tracker uses it and it would have saved me hours.

Step 1: get a stack trace of the busy-loop process

sample <pid> 3 -mayDie 2>&1 | head -120

For my hung gateway pid (35626), the entire 3 seconds of samples (2141 frames) showed the same call stack:

node::SpinEventLoopInternal
  → uv_run → uv__io_poll → uv__async_io → uv__work_done
    → node::MakeLibuvRequestCallback<uv_fs_s, ...>
      → node::fs::FileHandle::ClosePromise()::$_0::__invoke   ← file just closed
        → node::InternalCallbackScope::Close
          → v8::internal::MicrotaskQueue::PerformCheckpointInternal
            → v8::internal::Execution::TryRunMicrotasks
              → Builtins_RunMicrotasks
                → Builtins_PromiseFulfillReactionJob
                  → Builtins_AsyncFunctionAwaitResolveClosure   ← async fn resumed
                    → Builtins_InterpreterEntryTrampoline       ← JS function
                      → Builtins_StringPrototypeReplace            (1725 / 2141 samples = 80%)
                        → Builtins_RegExpReplace                   (1498 samples)
                          → v8::internal::Runtime_FlattenString    (227 samples)
                            → String::WriteToFlat2 → memmove        (214 samples)
                            → Heap::CollectGarbage → Heap::Scavenge (13 samples — GC under pressure)
                      → Builtins_StringPrototypeLastIndexOf        (387 samples)
                        → String::LastIndexOf → StringMatchBackwards

Translation: 80% of CPU time is in String.prototype.replace with a regex on a huge string. The cons-string has to be flattened (via memmove of 214 samples worth of bytes), and v8 GC fires under memory pressure during the operation. This happens immediately after a file close callback resolves an awaited promise.

Step 2: find what file was just read

lsof -p <pid> | grep -E "sessions|jsonl|memory|workspace"

For my pid:

node 35626 ... 46r REG 1,17 465681629 ... /Users/.../06082bfe-...jsonl
                                ^^^^^^^^^
                                465 MB read handle on a session jsonl file

A session jsonl file open as read handle, 465 MB. That's the input string for the regex.

Step 3: confirm with file size scan

ls -lhS ~/.openclaw/agents/main/sessions/*.jsonl | head -10
-rw------- 444M Apr 11 11:42 06082bfe-16d8-490e-ad96-3092962fc7ab.jsonl   ← MONSTER
-rw------- 9.9M Mar 26       1b5f4f73-...
-rw------- 8.3M Apr  2       08df5e6f-...
-rw------- 7.8M Apr 11 12:24 c23c07af-...                                  ← current active main session (normal size)
-rw------- 5.3M Apr  9       8082ca14-...
-rw------- 5.1M Apr 10       aded36b2-...

One session file is 50× larger than every other. The main agent's other sessions max out at 9.9 MB. This one is 444 MB.

Root cause: how did it get to 444 MB?

The session was created 2026-04-09 00:47 (during my DenchClaw install/cleanup drama, see #59820 thread for context). Last appended 2026-04-11 11:42 (about an hour before I noticed it).

Tail of the file shows the bloat pattern:

{"type":"message",...,"role":"assistant","content":[],"stopReason":"aborted","errorMessage":"Request was aborted"}
{"type":"message",...,"role":"assistant","content":[{"type":"text","text":"Hey, bin wieder da! 🫡 Sorry für die Stille — der Codex OAuth Token war expired und hat alle GPT-5.4 Sub-Agent Calls blockiert..."}]}

A pattern of:

  • Aborted assistant turns (model call timeouts, OAuth expirations, gateway restarts mid-stream)
  • Each abort is appended as a separate jsonl line
  • Subagent retries that didn't terminate cleanly
  • Each retry appends artifacts to the same session file
  • Over 2 days this accumulated to 157,879 lines / 444 MB

There's no rotation, no size limit, no truncation policy. The session file just grows indefinitely.

The fix (user-side workaround)

# 1. Stop the gateway (SIGKILL needed because event loop is blocked, signals can't run)
kill -9 <pid>

# 2. Archive (NOT delete) the monster session — preserve for forensics
mkdir -p ~/.openclaw/archive/sessions-archived-$(date +%Y-%m-%d)
mv ~/.openclaw/agents/main/sessions/06082bfe-*.jsonl* ~/.openclaw/archive/sessions-archived-$(date +%Y-%m-%d)/

# 3. Restart the LaunchAgent
launchctl bootstrap gui/$UID ~/Library/LaunchAgents/ai.openclaw.gateway.plist

# 4. Clean up orphaned entries in sessions.json
openclaw sessions cleanup --store ~/.openclaw/agents/main/sessions/sessions.json --enforce --fix-missing

After this, my Discord bot started responding within 1 second (pong. Ich bin da.), gateway STAT transitioned from R (busy) to S (sleeping/idle), and openclaw gateway health returned in 1108ms with all channels green. Stable for hours after the fix.

Preventive monitoring (works today, no code changes needed)

While waiting for an upstream fix, users can detect the bug before it kills the gateway with a simple cron job. This is what I set up on my own machine after recovering from the incident:

~/.openclaw/scripts/check-bloated-sessions.sh

#!/bin/bash
# Detects openclaw session jsonl files that have grown beyond a safe size.
# Bloated sessions can hang the gateway via String.prototype.replace on huge strings.

set -uo pipefail

THRESHOLD_MB=50
LOGFILE="$HOME/.openclaw/logs/bloated-sessions.log"
SCAN_ROOT="$HOME/.openclaw/agents"

mkdir -p "$(dirname "$LOGFILE")"
ts=$(date +"%Y-%m-%dT%H:%M:%S%z")

results=$(find "$SCAN_ROOT" -type f -name "*.jsonl" -size +${THRESHOLD_MB}M -exec ls -lh {} \; 2>/dev/null)

if [ -n "$results" ]; then
  count=$(printf '%s\n' "$results" | wc -l | tr -d ' ')
  {
    echo "[$ts] WARNING: found $count session jsonl(s) over ${THRESHOLD_MB}MB (gateway hang risk)"
    printf '%s\n' "$results"
    echo ""
  } >> "$LOGFILE"

  # macOS native notification
  osascript -e "display notification \"Found $count bloated openclaw session(s) over ${THRESHOLD_MB}MB. Check ~/.openclaw/logs/bloated-sessions.log\" with title \"OpenClaw: bloated session detected\"" 2>/dev/null || true

  exit 1
else
  echo "[$ts] OK: no session jsonl over ${THRESHOLD_MB}MB" >> "$LOGFILE"
  exit 0
fi

Crontab entry (daily at 9am):

# OpenClaw bloated session monitor
0 9 * * * $HOME/.openclaw/scripts/check-bloated-sessions.sh

The cron runs independently of openclaw itself, so even if the gateway is hung, the check still runs and notifies you. On macOS the osascript line produces a native notification banner; on Linux you'd swap that for notify-send or whatever your DE uses.

Why 50 MB is the right threshold: I checked all my session files after the incident. Other healthy main-agent sessions max out at ~10 MB. Anything over 50 MB is suspicious; anything over 100 MB is almost certainly broken. find -size +50M catches the issue while it's still recoverable (smaller files = faster archive + less GC pressure during the read).

This is a 30-second setup that would have saved me ~6 hours of debugging if I'd known about the bug class. Recommending all openclaw users do this until upstream ships a real fix.

Suggested upstream fixes

  1. Hard size limit on session jsonl files (e.g. 50 MB) with auto-rotation to numbered files:

    <id>.jsonl       (current, capped)
    <id>.jsonl.001   (rotated)
    <id>.jsonl.002
    ...
    

    Sessions could optionally be loaded across all rotation files when needed, but the current "load entire session" code path would only ever see a small file.

  2. openclaw doctor should warn on bloated session files, e.g. anything over 50 MB. Currently doctor checks plist/auth/cron/locks but never validates session file sizes. A simple find ~/.openclaw/agents -name "*.jsonl" -size +50M check would catch this in seconds.

  3. The agent processing pipeline should never String.prototype.replace on a multi-MB string in a single call. Either stream/chunk the regex processing, or restrict the regex to only the most-recent N turns.

  4. Failed/aborted retries should not append to the active session file — they should go to a quarantine file or be discarded. The retry path should not contribute to session bloat.

  5. openclaw gateway health should distinguish between "process running and serving" and "process running but event loop blocked". The current health check times out and reports Discord: failed (operation aborted) which is misleading — Discord is fine, the event loop is blocked. A direct event-loop liveness probe (e.g. setImmediate round-trip with a tight timeout) would catch this.

  6. Document discovery.mdns.mode and other escape-hatch config keys publicly. The mdns kill switch (mode: "off") isn't in any user-facing docs but is the documented workaround for issue Bonjour/mDNS Advertiser Stuck in Flapping Loop #64484. Same likely applies to other internal config keys.

Why this matters

This bug is silent. An openclaw user installs it, uses it for weeks, suddenly the gateway "just hangs" and they have no idea why. Restarting helps for a few minutes (until the agent processes another message that triggers the regex). The gateway looks healthy to every monitoring tool. Logs go silent at the moment of the hang. Signal handlers stop working.

If you don't know to run sample and lsof (and most users won't), you'll spend hours chasing wrong rabbit holes — exactly like I did. The presence of the issues #64484 (Bonjour cascade) and #59820 (missing bundled deps) actively misled me because both have the same surface symptoms (gateway hangs, signals ignored, slow health checks, Discord aborts).

The 444 MB session was a separate, undiagnosed problem that mimicked the symptoms of those other bugs, and I only found it through stack-trace analysis after fixing the other (real) issues didn't help.

Related issues (cross-references)

Acknowledgments

This investigation took ~6 hours of pair-debugging with an AI assistant (Claude Code via Claude Opus 4.6, separate from the openclaw gateway being debugged). The AI assistant ran the sample, lsof, and log-analysis steps and identified the regex hang signature. The diagnostic chain was novel — sample is not used elsewhere in the openclaw issue tracker as far as I can find.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions