Skip to content

High CPU, extreme control-plane RPC latency, and unstable polling after upgrade from 2026.4.24 to 2026.4.29/2026.5.2 #76562

@Nsch11

Description

@Nsch11

Summary

After upgrading OpenClaw from 2026.4.24 to newer releases (2026.4.29 and later 2026.5.2), the gateway exhibits severe performance regressions affecting both control-plane responsiveness and overall system stability.

Symptoms include:

  • CPU pinned near 100% (Node process)
  • control-plane RPC calls becoming extremely slow or timing out
  • UI/WebSocket polling effectively unusable
  • intermittent fetch timeouts to external APIs (despite system-level connectivity being fine)
  • significant improvement after reverting both binary version and config state

This strongly suggests a regression introduced after 2026.4.24, potentially involving interaction with config/state written by newer versions.


Environment

  • Host: Linux x64
  • OpenClaw versions tested:
    • 2026.4.24 (stable)
    • 2026.4.26
    • 2026.4.29
    • 2026.5.2
  • Node versions:
    • v22.22.2
    • v24.15.0
  • Gateway mode:
    • systemd user service
    • LAN binding
  • Main model route:
    • KiloCode (free)
  • Features:
    • Telegram enabled
    • Bonjour disabled (in most tests)
    • memory search disabled (in later tests)

Symptoms on newer versions (2026.4.29+)

  • gateway process consumes ~95–100% CPU
  • control-plane RPC latency becomes extreme:
    • node.list: up to ~85–109s
    • agents.list: ~67s
    • sessions.list: ~58s
    • cron.list / cron.status: up to ~144s
    • models.list: several seconds, sometimes ~18–29s
  • /health and root HTTP endpoints may timeout while WS/RPC still partially function
  • Telegram provider operations timing out:
    • getMe
    • setMyCommands
    • deleteWebhook
  • internal fetch timeouts, while system-level curl works

Representative logs:

  • liveness warning: eventLoopUtilization=1
  • eventLoopDelayMaxMs in tens of seconds
  • fetch timeout after ... operation=fetchWithTimeout
  • CommandLaneTaskTimeoutError

Key Observations

1. System networking is healthy

  • DNS resolution: OK
  • TLS handshake: OK
  • Direct curl to Telegram API: OK

Indicates the issue is internal (event loop / runtime), not network-level.

2. Config/state from newer versions degrades older versions

  • After running newer releases, even older binaries perform worse
  • Warning observed when loading config:
    • “Config was last written by a newer OpenClaw..."

Suggests possible config/state migration side effects.

3. Startup shows heavy control-plane work

Notable time spent in:

  • plugins.bootstrap
  • http.bound
  • post-attach

Implies overhead beyond normal channel initialization.

4. Polling pattern is normal, but handler cost is abnormal

UI calls appear standard (node.list, cron.status, etc.), but execution cost becomes extremely high in newer versions.


Background / Cron Observations

  • cron.status / cron.list sometimes normal, sometimes extremely slow
  • memory/dreaming job historically takes tens of seconds
  • heartbeat activity present in logs

However:

  • disabling heartbeat did not fully fix issue
  • disabling memory dreaming did not fully fix issue

Conclusion: contributing factors, but not root cause.


What was tried

  • disable bonjour
  • disable telegram
  • disable memory-wiki
  • disable kilocode
  • disable acpx
  • disable browser
  • disable memory search
  • Node 22 vs Node 24
  • clean reinstall
  • config cleanup
  • archive session transcripts
  • trim MEMORY.md
  • disable heartbeat
  • disable memory-core dreaming

Result: partial improvements, but regression persists on newer versions.


What resolved the issue

  • downgrade to 2026.4.24
  • restore older-compatible config backup
  • reapply minimal config:
    • Telegram enabled
    • Bonjour disabled
    • memory search disabled
    • memory-core dreaming disabled

Result:

  • CPU dropped significantly
  • control-plane stabilized
  • polling overhead reduced

Suspected regression areas

Likely candidates:

  • control-plane RPC handlers
  • node.list / presence pipeline
  • cron collectors (cron.status, cron.list)
  • model/provider listing path
  • internal fetch (undici) under event-loop pressure
  • config/state migration logic

Request

Please investigate regressions introduced after 2026.4.24, especially in:

  • control-plane polling handlers
  • cron/status collectors
  • node presence (node.list)
  • model/provider listing
  • config migration / effective config behavior
  • fetch timeouts under high event-loop utilization

Additional sanitized logs can be provided if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions