You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug] Bloated session JSONL (444 MB) blocks event loop via String.prototype.replace — gateway becomes unresponsive on agent message processing
Summary
A single session jsonl file in ~/.openclaw/agents/main/sessions/ grew unbounded to 444 MB / 157,879 lines over 2 days. Whenever the agent processed a new message (via Discord OR TUI), the gateway loaded this file's content into a JS string and ran String.prototype.replace(/regex/, ...) on it, blocking the main thread indefinitely. The gateway appeared "running" to launchd but was completely unresponsive — openclaw gateway health timed out at 10000ms, signals were ignored (event loop blocked), and the only recovery was kill -9 + archiving the bloated file.
This bug pattern is not currently filed as far as I can find, despite likely affecting many users silently. I am submitting it because the diagnostic technique I used (sample + lsof) appears to be novel for this codebase and the root-cause finding is valuable.
Memory backend: was qmd, switched to builtin during debugging (separate issue, kept)
Active main agent session count: 68 (after cleanup)
Symptoms (the misleading ones that wasted hours of debugging)
The bloated session manifests as several apparently-unrelated symptoms that lead investigators down wrong paths:
openclaw gateway health times out at 10000ms with Discord: failed (operation aborted) while Telegram is fast (50ms)
Process state STAT=R, WCHAN=- — actively running on CPU, not blocked on I/O
SIGTERM and SIGINT are ignored — kill returns 0 but the process never exits (because the event loop is blocked, the signal handler can never run). Looks identical to a "broken signal trap" bug
Discord bots show "logged in" in earlier log lines but stop responding to new messages — looks like a Discord connection issue
TUI sometimes works for a single message then dies — looks like a Discord-specific bug because the visible failure is on Discord but TUI is also affected
launchctl print shows the process as state = running with runs = 1 and last exit code = (never exited) — looks healthy to launchd
No errors in gateway.err.log — the log just goes silent at the moment of the hang
Health-monitor restart cycle every 5-7 minutes — looks like a connection issue
I spent ~6 hours chasing these symptoms across multiple wrong rabbit holes (mDNS cascade #64484, missing bundled deps #59820, QMD sidecar #64351, Discord race #57075, streaming partial races) — each of which is a real bug but none of which were the actual cause of the hangs in this case.
The smoking gun: sample + lsof
I want to flag this diagnostic technique because nobody else in the openclaw issue tracker uses it and it would have saved me hours.
Step 1: get a stack trace of the busy-loop process
sample <pid> 3 -mayDie 2>&1| head -120
For my hung gateway pid (35626), the entire 3 seconds of samples (2141 frames) showed the same call stack:
Translation: 80% of CPU time is in String.prototype.replace with a regex on a huge string. The cons-string has to be flattened (via memmove of 214 samples worth of bytes), and v8 GC fires under memory pressure during the operation. This happens immediately after a file close callback resolves an awaited promise.
One session file is 50× larger than every other. The main agent's other sessions max out at 9.9 MB. This one is 444 MB.
Root cause: how did it get to 444 MB?
The session was created 2026-04-09 00:47 (during my DenchClaw install/cleanup drama, see #59820 thread for context). Last appended 2026-04-11 11:42 (about an hour before I noticed it).
Tail of the file shows the bloat pattern:
{"type":"message",...,"role":"assistant","content":[],"stopReason":"aborted","errorMessage":"Request was aborted"}
{"type":"message",...,"role":"assistant","content":[{"type":"text","text":"Hey, bin wieder da! 🫡 Sorry für die Stille — der Codex OAuth Token war expired und hat alle GPT-5.4 Sub-Agent Calls blockiert..."}]}
Each retry appends artifacts to the same session file
Over 2 days this accumulated to 157,879 lines / 444 MB
There's no rotation, no size limit, no truncation policy. The session file just grows indefinitely.
The fix (user-side workaround)
# 1. Stop the gateway (SIGKILL needed because event loop is blocked, signals can't run)kill -9 <pid># 2. Archive (NOT delete) the monster session — preserve for forensics
mkdir -p ~/.openclaw/archive/sessions-archived-$(date +%Y-%m-%d)
mv ~/.openclaw/agents/main/sessions/06082bfe-*.jsonl*~/.openclaw/archive/sessions-archived-$(date +%Y-%m-%d)/
# 3. Restart the LaunchAgent
launchctl bootstrap gui/$UID~/Library/LaunchAgents/ai.openclaw.gateway.plist
# 4. Clean up orphaned entries in sessions.json
openclaw sessions cleanup --store ~/.openclaw/agents/main/sessions/sessions.json --enforce --fix-missing
After this, my Discord bot started responding within 1 second (pong. Ich bin da.), gateway STAT transitioned from R (busy) to S (sleeping/idle), and openclaw gateway health returned in 1108ms with all channels green. Stable for hours after the fix.
Preventive monitoring (works today, no code changes needed)
While waiting for an upstream fix, users can detect the bug before it kills the gateway with a simple cron job. This is what I set up on my own machine after recovering from the incident:
~/.openclaw/scripts/check-bloated-sessions.sh
#!/bin/bash# Detects openclaw session jsonl files that have grown beyond a safe size.# Bloated sessions can hang the gateway via String.prototype.replace on huge strings.set -uo pipefail
THRESHOLD_MB=50
LOGFILE="$HOME/.openclaw/logs/bloated-sessions.log"
SCAN_ROOT="$HOME/.openclaw/agents"
mkdir -p "$(dirname "$LOGFILE")"
ts=$(date +"%Y-%m-%dT%H:%M:%S%z")
results=$(find "$SCAN_ROOT" -type f -name "*.jsonl" -size +${THRESHOLD_MB}M -exec ls -lh {} \;2>/dev/null)if [ -n"$results" ];then
count=$(printf '%s\n'"$results"| wc -l | tr -d '')
{
echo"[$ts] WARNING: found $count session jsonl(s) over ${THRESHOLD_MB}MB (gateway hang risk)"printf'%s\n'"$results"echo""
} >>"$LOGFILE"# macOS native notification
osascript -e "display notification \"Found $count bloated openclaw session(s) over ${THRESHOLD_MB}MB. Check ~/.openclaw/logs/bloated-sessions.log\" with title \"OpenClaw: bloated session detected\""2>/dev/null ||trueexit 1
elseecho"[$ts] OK: no session jsonl over ${THRESHOLD_MB}MB">>"$LOGFILE"exit 0
fi
The cron runs independently of openclaw itself, so even if the gateway is hung, the check still runs and notifies you. On macOS the osascript line produces a native notification banner; on Linux you'd swap that for notify-send or whatever your DE uses.
Why 50 MB is the right threshold: I checked all my session files after the incident. Other healthy main-agent sessions max out at ~10 MB. Anything over 50 MB is suspicious; anything over 100 MB is almost certainly broken. find -size +50M catches the issue while it's still recoverable (smaller files = faster archive + less GC pressure during the read).
This is a 30-second setup that would have saved me ~6 hours of debugging if I'd known about the bug class. Recommending all openclaw users do this until upstream ships a real fix.
Suggested upstream fixes
Hard size limit on session jsonl files (e.g. 50 MB) with auto-rotation to numbered files:
Sessions could optionally be loaded across all rotation files when needed, but the current "load entire session" code path would only ever see a small file.
openclaw doctor should warn on bloated session files, e.g. anything over 50 MB. Currently doctor checks plist/auth/cron/locks but never validates session file sizes. A simple find ~/.openclaw/agents -name "*.jsonl" -size +50M check would catch this in seconds.
The agent processing pipeline should never String.prototype.replace on a multi-MB string in a single call. Either stream/chunk the regex processing, or restrict the regex to only the most-recent N turns.
Failed/aborted retries should not append to the active session file — they should go to a quarantine file or be discarded. The retry path should not contribute to session bloat.
openclaw gateway health should distinguish between "process running and serving" and "process running but event loop blocked". The current health check times out and reports Discord: failed (operation aborted) which is misleading — Discord is fine, the event loop is blocked. A direct event-loop liveness probe (e.g. setImmediate round-trip with a tight timeout) would catch this.
Document discovery.mdns.mode and other escape-hatch config keys publicly. The mdns kill switch (mode: "off") isn't in any user-facing docs but is the documented workaround for issue Bonjour/mDNS Advertiser Stuck in Flapping Loop #64484. Same likely applies to other internal config keys.
Why this matters
This bug is silent. An openclaw user installs it, uses it for weeks, suddenly the gateway "just hangs" and they have no idea why. Restarting helps for a few minutes (until the agent processes another message that triggers the regex). The gateway looks healthy to every monitoring tool. Logs go silent at the moment of the hang. Signal handlers stop working.
If you don't know to run sample and lsof (and most users won't), you'll spend hours chasing wrong rabbit holes — exactly like I did. The presence of the issues #64484 (Bonjour cascade) and #59820 (missing bundled deps) actively misled me because both have the same surface symptoms (gateway hangs, signals ignored, slow health checks, Discord aborts).
The 444 MB session was a separate, undiagnosed problem that mimicked the symptoms of those other bugs, and I only found it through stack-trace analysis after fixing the other (real) issues didn't help.
This investigation took ~6 hours of pair-debugging with an AI assistant (Claude Code via Claude Opus 4.6, separate from the openclaw gateway being debugged). The AI assistant ran the sample, lsof, and log-analysis steps and identified the regex hang signature. The diagnostic chain was novel — sample is not used elsewhere in the openclaw issue tracker as far as I can find.
[Bug] Bloated session JSONL (444 MB) blocks event loop via String.prototype.replace — gateway becomes unresponsive on agent message processing
Summary
A single session jsonl file in
~/.openclaw/agents/main/sessions/grew unbounded to 444 MB / 157,879 lines over 2 days. Whenever the agent processed a new message (via Discord OR TUI), the gateway loaded this file's content into a JS string and ranString.prototype.replace(/regex/, ...)on it, blocking the main thread indefinitely. The gateway appeared "running" to launchd but was completely unresponsive —openclaw gateway healthtimed out at 10000ms, signals were ignored (event loop blocked), and the only recovery waskill -9+ archiving the bloated file.This bug pattern is not currently filed as far as I can find, despite likely affecting many users silently. I am submitting it because the diagnostic technique I used (
sample+lsof) appears to be novel for this codebase and the root-cause finding is valuable.Environment
/opt/homebrew/lib/node_modules/openclawvia npm global~/Library/LaunchAgents/ai.openclaw.gateway.plist), bind=loopbackSymptoms (the misleading ones that wasted hours of debugging)
The bloated session manifests as several apparently-unrelated symptoms that lead investigators down wrong paths:
openclaw gateway healthtimes out at 10000ms withDiscord: failed (operation aborted)while Telegram is fast (50ms)STAT=R, WCHAN=-— actively running on CPU, not blocked on I/OSIGTERMandSIGINTare ignored — kill returns 0 but the process never exits (because the event loop is blocked, the signal handler can never run). Looks identical to a "broken signal trap" buglaunchctl printshows the process asstate = runningwithruns = 1andlast exit code = (never exited)— looks healthy to launchdgateway.err.log— the log just goes silent at the moment of the hangI spent ~6 hours chasing these symptoms across multiple wrong rabbit holes (mDNS cascade #64484, missing bundled deps #59820, QMD sidecar #64351, Discord race #57075, streaming partial races) — each of which is a real bug but none of which were the actual cause of the hangs in this case.
The smoking gun:
sample+lsofI want to flag this diagnostic technique because nobody else in the openclaw issue tracker uses it and it would have saved me hours.
Step 1: get a stack trace of the busy-loop process
For my hung gateway pid (35626), the entire 3 seconds of samples (2141 frames) showed the same call stack:
Translation: 80% of CPU time is in
String.prototype.replacewith a regex on a huge string. The cons-string has to be flattened (viamemmoveof 214 samples worth of bytes), and v8 GC fires under memory pressure during the operation. This happens immediately after a file close callback resolves an awaited promise.Step 2: find what file was just read
For my pid:
A session jsonl file open as read handle, 465 MB. That's the input string for the regex.
Step 3: confirm with file size scan
One session file is 50× larger than every other. The main agent's other sessions max out at 9.9 MB. This one is 444 MB.
Root cause: how did it get to 444 MB?
The session was created 2026-04-09 00:47 (during my DenchClaw install/cleanup drama, see #59820 thread for context). Last appended 2026-04-11 11:42 (about an hour before I noticed it).
Tail of the file shows the bloat pattern:
{"type":"message",...,"role":"assistant","content":[],"stopReason":"aborted","errorMessage":"Request was aborted"} {"type":"message",...,"role":"assistant","content":[{"type":"text","text":"Hey, bin wieder da! 🫡 Sorry für die Stille — der Codex OAuth Token war expired und hat alle GPT-5.4 Sub-Agent Calls blockiert..."}]}A pattern of:
There's no rotation, no size limit, no truncation policy. The session file just grows indefinitely.
The fix (user-side workaround)
After this, my Discord bot started responding within 1 second (
pong. Ich bin da.), gatewaySTATtransitioned fromR(busy) toS(sleeping/idle), andopenclaw gateway healthreturned in 1108ms with all channels green. Stable for hours after the fix.Preventive monitoring (works today, no code changes needed)
While waiting for an upstream fix, users can detect the bug before it kills the gateway with a simple cron job. This is what I set up on my own machine after recovering from the incident:
~/.openclaw/scripts/check-bloated-sessions.shCrontab entry (daily at 9am):
The cron runs independently of openclaw itself, so even if the gateway is hung, the check still runs and notifies you. On macOS the
osascriptline produces a native notification banner; on Linux you'd swap that fornotify-sendor whatever your DE uses.Why 50 MB is the right threshold: I checked all my session files after the incident. Other healthy main-agent sessions max out at ~10 MB. Anything over 50 MB is suspicious; anything over 100 MB is almost certainly broken.
find -size +50Mcatches the issue while it's still recoverable (smaller files = faster archive + less GC pressure during the read).This is a 30-second setup that would have saved me ~6 hours of debugging if I'd known about the bug class. Recommending all openclaw users do this until upstream ships a real fix.
Suggested upstream fixes
Hard size limit on session jsonl files (e.g. 50 MB) with auto-rotation to numbered files:
Sessions could optionally be loaded across all rotation files when needed, but the current "load entire session" code path would only ever see a small file.
openclaw doctorshould warn on bloated session files, e.g. anything over 50 MB. Currently doctor checks plist/auth/cron/locks but never validates session file sizes. A simplefind ~/.openclaw/agents -name "*.jsonl" -size +50Mcheck would catch this in seconds.The agent processing pipeline should never
String.prototype.replaceon a multi-MB string in a single call. Either stream/chunk the regex processing, or restrict the regex to only the most-recent N turns.Failed/aborted retries should not append to the active session file — they should go to a quarantine file or be discarded. The retry path should not contribute to session bloat.
openclaw gateway healthshould distinguish between "process running and serving" and "process running but event loop blocked". The current health check times out and reportsDiscord: failed (operation aborted)which is misleading — Discord is fine, the event loop is blocked. A direct event-loop liveness probe (e.g.setImmediateround-trip with a tight timeout) would catch this.Document
discovery.mdns.modeand other escape-hatch config keys publicly. The mdns kill switch (mode: "off") isn't in any user-facing docs but is the documented workaround for issue Bonjour/mDNS Advertiser Stuck in Flapping Loop #64484. Same likely applies to other internal config keys.Why this matters
This bug is silent. An openclaw user installs it, uses it for weeks, suddenly the gateway "just hangs" and they have no idea why. Restarting helps for a few minutes (until the agent processes another message that triggers the regex). The gateway looks healthy to every monitoring tool. Logs go silent at the moment of the hang. Signal handlers stop working.
If you don't know to run
sampleandlsof(and most users won't), you'll spend hours chasing wrong rabbit holes — exactly like I did. The presence of the issues #64484 (Bonjour cascade) and #59820 (missing bundled deps) actively misled me because both have the same surface symptoms (gateway hangs, signals ignored, slow health checks, Discord aborts).The 444 MB session was a separate, undiagnosed problem that mimicked the symptoms of those other bugs, and I only found it through stack-trace analysis after fixing the other (real) issues didn't help.
Related issues (cross-references)
Acknowledgments
This investigation took ~6 hours of pair-debugging with an AI assistant (Claude Code via Claude Opus 4.6, separate from the openclaw gateway being debugged). The AI assistant ran the
sample,lsof, and log-analysis steps and identified the regex hang signature. The diagnostic chain was novel —sampleis not used elsewhere in the openclaw issue tracker as far as I can find.