[Bug]: Gateway CPU pinned at 100%: root causes & workarounds (complements #75688)

### Bug type

Behavior bug (incorrect output/state without crash)

### Beta release blocker

No

### Summary

# Gateway CPU 100-130% idle — root causes identified & workarounds (v2026.4.29)

Related to #75688 — same version, same symptoms (100% CPU from startup, ~724MB RAM, `node.list` 20s+ latency). This issue provides **identified root causes and working mitigations**.

## Environment
- OS: Ubuntu Linux (systemd user service)
- Node: v22.22.1
- OpenClaw: v2026.4.29 (gateway mode)
- Agent: groq/qwen3-32b (free tier), fallbacks: deepseek-v4-flash, gemini-2.5-flash
- Channels: WhatsApp (Baileys)
- Hardware: dedicated Linux server (24GB RAM, 8 cores)

## Symptom
Gateway process sits at 100-130% CPU permanently, even with **zero inbound messages**. The gateway becomes unresponsive or responds with 60s+ delays. Killing and restarting reproduces the issue within minutes.

## Root Causes Identified

After extensive debugging, we found **multiple independent issues compounding** into permanent CPU saturation:

### 1. Zombie sessions re-launching on every boot (main culprit)
Persisted session files in `~/.openclaw/agents/*/sessions/*.jsonl` re-launch "embedded runs" on every gateway start. Even after the user runs `/new` on WhatsApp, the old session file remains on disk and triggers a new agent run at boot.

### 2. Compaction safeguard re-trigger loop
When a session has an empty or already-compacted context, the safeguard fires repeatedly:
```
[compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; writing compaction boundary to suppress re-trigger loop.
```
Despite the log saying "suppress re-trigger loop", it does **not** actually stop — it triggers another embedded run toward the LLM on the next cycle.

### 3. Groq free tier 6000 TPM → fallback cascade with full re-tokenization
Accumulated context (~50k tokens) exceeds Groq's 6000 TPM limit → 413 rejection → fallback to DeepSeek → timeout → fallback to Gemini. Each fallback re-tokenizes the entire context on the Node.js main thread (CPU-bound).

### 4. Discord slash command deploy retry loop (even when disabled)
With `channels.discord.enabled: false`, the plugin still attempts to deploy slash commands at boot → gets rate-limited by Discord (429) → retries indefinitely in a tight loop.

### 5. `plugins.entries.X.enabled: false` does not prevent loading
Setting `lossless-claw` to `enabled: false` in `plugins.entries` does **not** prevent it from loading. The only workaround is to use `plugins.allow` as a whitelist to explicitly block it.

### 6. V8 GC thrashing — unbounded heap (ref: #13758)
Without `--max-old-space-size`, the heap grows unbounded with large conversation contexts, causing constant GC thrashing. Related to #13758 / #6413.

### 7. Plugin runtime staging on every inbound message
31 NPM dependencies are re-resolved on every single inbound message (even if already installed). Takes 1-16 seconds + CPU each time.

## Workarounds Applied

| Workaround | CPU Impact |
|---|---|
| Delete zombie sessions (`rm ~/.openclaw/agents/*/sessions/*.jsonl`) | 100%+ → 30% |
| `NODE_OPTIONS=--max-old-space-size=1536` in systemd env | Reduces GC thrashing |
| Disable Discord channel entirely | Eliminates 429 retry loop |
| Use `plugins.allow` whitelist to block unwanted plugins | Prevents parasitic loading |
| Disable `hooks.internal.entries.session-memory` | Reduces unnecessary disk writes |
| Set `contextTokens: 128000` (was 32000) | Stops compaction safeguard loop |
| Purge entire `~/.openclaw/agents/` directory | Clean session reset |

After all workarounds: **~15% idle CPU** (acceptable), with temporary spikes during message processing (tokenization + model resolution + streaming).

## Suggestions

1. **Compaction safeguard** should not trigger an embedded run when there's nothing to compact — it should just no-op
2. **Sessions** should have a TTL or auto-clean when the user starts a new session
3. **Plugin runtime staging** should cache by spec hash instead of re-resolving on every message
4. **`plugins.entries.X.enabled: false`** should be sufficient to prevent loading without needing a `plugins.allow` whitelist
5. **Disabled channels** (`enabled: false`) should not load any connection logic or attempt external API calls at boot
6. **Model fallback** should not re-tokenize the full context from scratch — the token count from the first attempt should be reusable

## Diagnostic breadcrumbs

```
[diagnostic] liveness warning: reasons=event_loop_delay interval=36s eventLoopDelayP99Ms=21.3 eventLoopDelayMaxMs=10351.5 eventLoopUtilization=0.662
```

Event loop blocked for **10+ seconds** during idle — confirms main-thread CPU spin, not I/O wait.

## Correlation with #75688

The reporter in #75688 observes the same pattern on macOS ARM64:
- 100% CPU from startup, never drops
- `node.list` latency 21-35s (we also see 9-11s)
- ~724MB RSS (we see 745MB before fixes)
- Plugin bundled runtime deps (30-31 specs) staging overhead
- Web UI polling exacerbates but is not causative

Their CPU profile shows all samples in `uv_run → uv__io_poll → uv__stream_io`, which is consistent with our finding that the event loop is saturated by synchronous tokenization and plugin resolution work blocking the libuv I/O thread.

The difference: we isolated the causes by disabling components one by one and identified that **zombie sessions + compaction safeguard loop** are the primary drivers, with plugin staging and disabled-but-still-active channels as amplifiers.

### Steps to reproduce

 1. Configure gateway mode with groq/qwen3-32b (free tier) + fallbacks                             
  2. Enable WhatsApp (Baileys) channel, disable Discord (enabled: false)                              
  3. Let a few sessions accumulate in ~/.openclaw/agents/*/sessions/    
  4. Restart the gateway                                                                              
  5. Observe CPU immediately climbing to 100%+ with no inbound messages                               
                                                                         

### Expected behavior

Gateway should be near-idle (~1-5% CPU) when no messages are being processed. Fallback cascades   
  should not trigger CPU-bound re-tokenization. Disabled channels/plugins should not run any logic.

### Actual behavior

Gateway sits at 100-130% CPU permanently with zero inbound messages. Responses take 60s+, node.list
  latency 20s+. Reproduces within minutes of restart

### OpenClaw version

v2026.4.29

### Operating system

Ubuntu Linux

### Install method

_No response_

### Model

groq/qwen3-32b

### Provider / routing chain

groq/qwen3-32b → deepseek-v4-flash → gemini-2.5-flash

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell

```

### Impact and severity

_No response_

### Additional information

_No response_

Workaround	CPU Impact
Delete zombie sessions (`rm ~/.openclaw/agents//sessions/.jsonl`)	100%+ → 30%
`NODE_OPTIONS=--max-old-space-size=1536` in systemd env	Reduces GC thrashing
Disable Discord channel entirely	Eliminates 429 retry loop
Use `plugins.allow` whitelist to block unwanted plugins	Prevents parasitic loading
Disable `hooks.internal.entries.session-memory`	Reduces unnecessary disk writes
Set `contextTokens: 128000` (was 32000)	Stops compaction safeguard loop
Purge entire `~/.openclaw/agents/` directory	Clean session reset

Uh oh!

[Bug]: Gateway CPU pinned at 100%: root causes & workarounds (complements #75688) #75707

Description

Bug type

Beta release blocker

Summary

Gateway CPU 100-130% idle — root causes identified & workarounds (v2026.4.29)

Environment

Symptom

Root Causes Identified

1. Zombie sessions re-launching on every boot (main culprit)

2. Compaction safeguard re-trigger loop

3. Groq free tier 6000 TPM → fallback cascade with full re-tokenization

4. Discord slash command deploy retry loop (even when disabled)

5. plugins.entries.X.enabled: false does not prevent loading

6. V8 GC thrashing — unbounded heap (ref: #13758)

7. Plugin runtime staging on every inbound message

Workarounds Applied

Suggestions

Diagnostic breadcrumbs

Correlation with #75688

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

5. `plugins.entries.X.enabled: false` does not prevent loading