Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
Gateway CPU 100-130% idle — root causes identified & workarounds (v2026.4.29)
Related to #75688 — same version, same symptoms (100% CPU from startup, ~724MB RAM, node.list 20s+ latency). This issue provides identified root causes and working mitigations.
Environment
- OS: Ubuntu Linux (systemd user service)
- Node: v22.22.1
- OpenClaw: v2026.4.29 (gateway mode)
- Agent: groq/qwen3-32b (free tier), fallbacks: deepseek-v4-flash, gemini-2.5-flash
- Channels: WhatsApp (Baileys)
- Hardware: dedicated Linux server (24GB RAM, 8 cores)
Symptom
Gateway process sits at 100-130% CPU permanently, even with zero inbound messages. The gateway becomes unresponsive or responds with 60s+ delays. Killing and restarting reproduces the issue within minutes.
Root Causes Identified
After extensive debugging, we found multiple independent issues compounding into permanent CPU saturation:
1. Zombie sessions re-launching on every boot (main culprit)
Persisted session files in ~/.openclaw/agents/*/sessions/*.jsonl re-launch "embedded runs" on every gateway start. Even after the user runs /new on WhatsApp, the old session file remains on disk and triggers a new agent run at boot.
2. Compaction safeguard re-trigger loop
When a session has an empty or already-compacted context, the safeguard fires repeatedly:
[compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; writing compaction boundary to suppress re-trigger loop.
Despite the log saying "suppress re-trigger loop", it does not actually stop — it triggers another embedded run toward the LLM on the next cycle.
3. Groq free tier 6000 TPM → fallback cascade with full re-tokenization
Accumulated context (~50k tokens) exceeds Groq's 6000 TPM limit → 413 rejection → fallback to DeepSeek → timeout → fallback to Gemini. Each fallback re-tokenizes the entire context on the Node.js main thread (CPU-bound).
4. Discord slash command deploy retry loop (even when disabled)
With channels.discord.enabled: false, the plugin still attempts to deploy slash commands at boot → gets rate-limited by Discord (429) → retries indefinitely in a tight loop.
5. plugins.entries.X.enabled: false does not prevent loading
Setting lossless-claw to enabled: false in plugins.entries does not prevent it from loading. The only workaround is to use plugins.allow as a whitelist to explicitly block it.
6. V8 GC thrashing — unbounded heap (ref: #13758)
Without --max-old-space-size, the heap grows unbounded with large conversation contexts, causing constant GC thrashing. Related to #13758 / #6413.
7. Plugin runtime staging on every inbound message
31 NPM dependencies are re-resolved on every single inbound message (even if already installed). Takes 1-16 seconds + CPU each time.
Workarounds Applied
| Workaround |
CPU Impact |
Delete zombie sessions (rm ~/.openclaw/agents/*/sessions/*.jsonl) |
100%+ → 30% |
NODE_OPTIONS=--max-old-space-size=1536 in systemd env |
Reduces GC thrashing |
| Disable Discord channel entirely |
Eliminates 429 retry loop |
Use plugins.allow whitelist to block unwanted plugins |
Prevents parasitic loading |
Disable hooks.internal.entries.session-memory |
Reduces unnecessary disk writes |
Set contextTokens: 128000 (was 32000) |
Stops compaction safeguard loop |
Purge entire ~/.openclaw/agents/ directory |
Clean session reset |
After all workarounds: ~15% idle CPU (acceptable), with temporary spikes during message processing (tokenization + model resolution + streaming).
Suggestions
- Compaction safeguard should not trigger an embedded run when there's nothing to compact — it should just no-op
- Sessions should have a TTL or auto-clean when the user starts a new session
- Plugin runtime staging should cache by spec hash instead of re-resolving on every message
plugins.entries.X.enabled: false should be sufficient to prevent loading without needing a plugins.allow whitelist
- Disabled channels (
enabled: false) should not load any connection logic or attempt external API calls at boot
- Model fallback should not re-tokenize the full context from scratch — the token count from the first attempt should be reusable
Diagnostic breadcrumbs
[diagnostic] liveness warning: reasons=event_loop_delay interval=36s eventLoopDelayP99Ms=21.3 eventLoopDelayMaxMs=10351.5 eventLoopUtilization=0.662
Event loop blocked for 10+ seconds during idle — confirms main-thread CPU spin, not I/O wait.
Correlation with #75688
The reporter in #75688 observes the same pattern on macOS ARM64:
- 100% CPU from startup, never drops
node.list latency 21-35s (we also see 9-11s)
- ~724MB RSS (we see 745MB before fixes)
- Plugin bundled runtime deps (30-31 specs) staging overhead
- Web UI polling exacerbates but is not causative
Their CPU profile shows all samples in uv_run → uv__io_poll → uv__stream_io, which is consistent with our finding that the event loop is saturated by synchronous tokenization and plugin resolution work blocking the libuv I/O thread.
The difference: we isolated the causes by disabling components one by one and identified that zombie sessions + compaction safeguard loop are the primary drivers, with plugin staging and disabled-but-still-active channels as amplifiers.
Steps to reproduce
- Configure gateway mode with groq/qwen3-32b (free tier) + fallbacks
- Enable WhatsApp (Baileys) channel, disable Discord (enabled: false)
- Let a few sessions accumulate in ~/.openclaw/agents/*/sessions/
- Restart the gateway
- Observe CPU immediately climbing to 100%+ with no inbound messages
Expected behavior
Gateway should be near-idle (~1-5% CPU) when no messages are being processed. Fallback cascades
should not trigger CPU-bound re-tokenization. Disabled channels/plugins should not run any logic.
Actual behavior
Gateway sits at 100-130% CPU permanently with zero inbound messages. Responses take 60s+, node.list
latency 20s+. Reproduces within minutes of restart
OpenClaw version
v2026.4.29
Operating system
Ubuntu Linux
Install method
No response
Model
groq/qwen3-32b
Provider / routing chain
groq/qwen3-32b → deepseek-v4-flash → gemini-2.5-flash
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
No response
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
Gateway CPU 100-130% idle — root causes identified & workarounds (v2026.4.29)
Related to #75688 — same version, same symptoms (100% CPU from startup, ~724MB RAM,
node.list20s+ latency). This issue provides identified root causes and working mitigations.Environment
Symptom
Gateway process sits at 100-130% CPU permanently, even with zero inbound messages. The gateway becomes unresponsive or responds with 60s+ delays. Killing and restarting reproduces the issue within minutes.
Root Causes Identified
After extensive debugging, we found multiple independent issues compounding into permanent CPU saturation:
1. Zombie sessions re-launching on every boot (main culprit)
Persisted session files in
~/.openclaw/agents/*/sessions/*.jsonlre-launch "embedded runs" on every gateway start. Even after the user runs/newon WhatsApp, the old session file remains on disk and triggers a new agent run at boot.2. Compaction safeguard re-trigger loop
When a session has an empty or already-compacted context, the safeguard fires repeatedly:
Despite the log saying "suppress re-trigger loop", it does not actually stop — it triggers another embedded run toward the LLM on the next cycle.
3. Groq free tier 6000 TPM → fallback cascade with full re-tokenization
Accumulated context (~50k tokens) exceeds Groq's 6000 TPM limit → 413 rejection → fallback to DeepSeek → timeout → fallback to Gemini. Each fallback re-tokenizes the entire context on the Node.js main thread (CPU-bound).
4. Discord slash command deploy retry loop (even when disabled)
With
channels.discord.enabled: false, the plugin still attempts to deploy slash commands at boot → gets rate-limited by Discord (429) → retries indefinitely in a tight loop.5.
plugins.entries.X.enabled: falsedoes not prevent loadingSetting
lossless-clawtoenabled: falseinplugins.entriesdoes not prevent it from loading. The only workaround is to useplugins.allowas a whitelist to explicitly block it.6. V8 GC thrashing — unbounded heap (ref: #13758)
Without
--max-old-space-size, the heap grows unbounded with large conversation contexts, causing constant GC thrashing. Related to #13758 / #6413.7. Plugin runtime staging on every inbound message
31 NPM dependencies are re-resolved on every single inbound message (even if already installed). Takes 1-16 seconds + CPU each time.
Workarounds Applied
rm ~/.openclaw/agents/*/sessions/*.jsonl)NODE_OPTIONS=--max-old-space-size=1536in systemd envplugins.allowwhitelist to block unwanted pluginshooks.internal.entries.session-memorycontextTokens: 128000(was 32000)~/.openclaw/agents/directoryAfter all workarounds: ~15% idle CPU (acceptable), with temporary spikes during message processing (tokenization + model resolution + streaming).
Suggestions
plugins.entries.X.enabled: falseshould be sufficient to prevent loading without needing aplugins.allowwhitelistenabled: false) should not load any connection logic or attempt external API calls at bootDiagnostic breadcrumbs
Event loop blocked for 10+ seconds during idle — confirms main-thread CPU spin, not I/O wait.
Correlation with #75688
The reporter in #75688 observes the same pattern on macOS ARM64:
node.listlatency 21-35s (we also see 9-11s)Their CPU profile shows all samples in
uv_run → uv__io_poll → uv__stream_io, which is consistent with our finding that the event loop is saturated by synchronous tokenization and plugin resolution work blocking the libuv I/O thread.The difference: we isolated the causes by disabling components one by one and identified that zombie sessions + compaction safeguard loop are the primary drivers, with plugin staging and disabled-but-still-active channels as amplifiers.
Steps to reproduce
Expected behavior
Gateway should be near-idle (~1-5% CPU) when no messages are being processed. Fallback cascades
should not trigger CPU-bound re-tokenization. Disabled channels/plugins should not run any logic.
Actual behavior
Gateway sits at 100-130% CPU permanently with zero inbound messages. Responses take 60s+, node.list
latency 20s+. Reproduces within minutes of restart
OpenClaw version
v2026.4.29
Operating system
Ubuntu Linux
Install method
No response
Model
groq/qwen3-32b
Provider / routing chain
groq/qwen3-32b → deepseek-v4-flash → gemini-2.5-flash
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
No response