Skip to content

[Bug]: Gateway CPU pinned at 100%: root causes & workarounds (complements #75688) #75707

@AnathemaOfficial

Description

@AnathemaOfficial

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Gateway CPU 100-130% idle — root causes identified & workarounds (v2026.4.29)

Related to #75688 — same version, same symptoms (100% CPU from startup, ~724MB RAM, node.list 20s+ latency). This issue provides identified root causes and working mitigations.

Environment

  • OS: Ubuntu Linux (systemd user service)
  • Node: v22.22.1
  • OpenClaw: v2026.4.29 (gateway mode)
  • Agent: groq/qwen3-32b (free tier), fallbacks: deepseek-v4-flash, gemini-2.5-flash
  • Channels: WhatsApp (Baileys)
  • Hardware: dedicated Linux server (24GB RAM, 8 cores)

Symptom

Gateway process sits at 100-130% CPU permanently, even with zero inbound messages. The gateway becomes unresponsive or responds with 60s+ delays. Killing and restarting reproduces the issue within minutes.

Root Causes Identified

After extensive debugging, we found multiple independent issues compounding into permanent CPU saturation:

1. Zombie sessions re-launching on every boot (main culprit)

Persisted session files in ~/.openclaw/agents/*/sessions/*.jsonl re-launch "embedded runs" on every gateway start. Even after the user runs /new on WhatsApp, the old session file remains on disk and triggers a new agent run at boot.

2. Compaction safeguard re-trigger loop

When a session has an empty or already-compacted context, the safeguard fires repeatedly:

[compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; writing compaction boundary to suppress re-trigger loop.

Despite the log saying "suppress re-trigger loop", it does not actually stop — it triggers another embedded run toward the LLM on the next cycle.

3. Groq free tier 6000 TPM → fallback cascade with full re-tokenization

Accumulated context (~50k tokens) exceeds Groq's 6000 TPM limit → 413 rejection → fallback to DeepSeek → timeout → fallback to Gemini. Each fallback re-tokenizes the entire context on the Node.js main thread (CPU-bound).

4. Discord slash command deploy retry loop (even when disabled)

With channels.discord.enabled: false, the plugin still attempts to deploy slash commands at boot → gets rate-limited by Discord (429) → retries indefinitely in a tight loop.

5. plugins.entries.X.enabled: false does not prevent loading

Setting lossless-claw to enabled: false in plugins.entries does not prevent it from loading. The only workaround is to use plugins.allow as a whitelist to explicitly block it.

6. V8 GC thrashing — unbounded heap (ref: #13758)

Without --max-old-space-size, the heap grows unbounded with large conversation contexts, causing constant GC thrashing. Related to #13758 / #6413.

7. Plugin runtime staging on every inbound message

31 NPM dependencies are re-resolved on every single inbound message (even if already installed). Takes 1-16 seconds + CPU each time.

Workarounds Applied

Workaround CPU Impact
Delete zombie sessions (rm ~/.openclaw/agents/*/sessions/*.jsonl) 100%+ → 30%
NODE_OPTIONS=--max-old-space-size=1536 in systemd env Reduces GC thrashing
Disable Discord channel entirely Eliminates 429 retry loop
Use plugins.allow whitelist to block unwanted plugins Prevents parasitic loading
Disable hooks.internal.entries.session-memory Reduces unnecessary disk writes
Set contextTokens: 128000 (was 32000) Stops compaction safeguard loop
Purge entire ~/.openclaw/agents/ directory Clean session reset

After all workarounds: ~15% idle CPU (acceptable), with temporary spikes during message processing (tokenization + model resolution + streaming).

Suggestions

  1. Compaction safeguard should not trigger an embedded run when there's nothing to compact — it should just no-op
  2. Sessions should have a TTL or auto-clean when the user starts a new session
  3. Plugin runtime staging should cache by spec hash instead of re-resolving on every message
  4. plugins.entries.X.enabled: false should be sufficient to prevent loading without needing a plugins.allow whitelist
  5. Disabled channels (enabled: false) should not load any connection logic or attempt external API calls at boot
  6. Model fallback should not re-tokenize the full context from scratch — the token count from the first attempt should be reusable

Diagnostic breadcrumbs

[diagnostic] liveness warning: reasons=event_loop_delay interval=36s eventLoopDelayP99Ms=21.3 eventLoopDelayMaxMs=10351.5 eventLoopUtilization=0.662

Event loop blocked for 10+ seconds during idle — confirms main-thread CPU spin, not I/O wait.

Correlation with #75688

The reporter in #75688 observes the same pattern on macOS ARM64:

  • 100% CPU from startup, never drops
  • node.list latency 21-35s (we also see 9-11s)
  • ~724MB RSS (we see 745MB before fixes)
  • Plugin bundled runtime deps (30-31 specs) staging overhead
  • Web UI polling exacerbates but is not causative

Their CPU profile shows all samples in uv_run → uv__io_poll → uv__stream_io, which is consistent with our finding that the event loop is saturated by synchronous tokenization and plugin resolution work blocking the libuv I/O thread.

The difference: we isolated the causes by disabling components one by one and identified that zombie sessions + compaction safeguard loop are the primary drivers, with plugin staging and disabled-but-still-active channels as amplifiers.

Steps to reproduce

  1. Configure gateway mode with groq/qwen3-32b (free tier) + fallbacks
  2. Enable WhatsApp (Baileys) channel, disable Discord (enabled: false)
  3. Let a few sessions accumulate in ~/.openclaw/agents/*/sessions/
  4. Restart the gateway
  5. Observe CPU immediately climbing to 100%+ with no inbound messages

Expected behavior

Gateway should be near-idle (~1-5% CPU) when no messages are being processed. Fallback cascades
should not trigger CPU-bound re-tokenization. Disabled channels/plugins should not run any logic.

Actual behavior

Gateway sits at 100-130% CPU permanently with zero inbound messages. Responses take 60s+, node.list
latency 20s+. Reproduces within minutes of restart

OpenClaw version

v2026.4.29

Operating system

Ubuntu Linux

Install method

No response

Model

groq/qwen3-32b

Provider / routing chain

groq/qwen3-32b → deepseek-v4-flash → gemini-2.5-flash

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingbug:behaviorIncorrect behavior without a crash

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions