Skip to content

[Bug]: Gateway WS self-contention still unresolved — cron tool timeouts from active sessions (#5703/#6508 circular-duped) #40237

@spectra-the-bot

Description

@spectra-the-bot

Summary

Gateway WS self-contention when calling cron tool from within an active LLM session is still unresolved. The original issues (#5703 and #6508) were circular-duped shut — each closed as duplicate of the other — without an actual fix landing.

The bug persists as of v2026.3.x. We hit it daily when triggering cron jobs from active sessions.

Reproduction (still works)

  1. From an active LLM session (e.g., Discord or Telegram), call the cron tool (run/list/add)
  2. The tool opens a second WS connection to the same gateway
  3. Gateway's single-threaded event loop is busy processing the current LLM turn
  4. Second WS request sits in queue, never gets processed → timeout after 10s
  5. The job actually runs successfully — it's just the ack that times out
Error: gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789

Also reproducible via CLI: openclaw cron run <jobId> from within an active session.

Root cause (unchanged from #6508)

The cron tool routes through a new WS connection to the gateway instead of using the existing session's IPC/WS channel. The gateway's Node.js event loop is occupied by the current LLM turn, so it can't respond to the second connection within the timeout window.

This is not a resource issue — CPU/memory are fine. It's purely single-threaded event loop contention.

Evidence from #6508 discussion

  • Internal IPC path (server-bridge-methods) works instantly from active sessions
  • External WS connections work fine when no session is active (17ms response)
  • The timeout only occurs when the same gateway is already processing an LLM turn
  • handshake=connected in logs confirms it's not an auth issue — the connection establishes, the gateway just never responds

Preferred fix (from #6508 community discussion)

Option B: In-process function calls for gateway-native tools

  • The internal cron tool already has an IPC path that works perfectly
  • Route embedded tool calls (cron, gateway config, etc.) through in-process IPC instead of WS
  • Zero overhead, immune to event loop contention
  • This is how some tools (e.g., gateway.config.get) already work internally

Alternative: Option A — Multiplex tool calls on the existing session WS channel instead of opening a new connection. More complex but also viable.

Current workaround

# Use CLI via exec tool instead of native cron tool
openclaw cron run <jobId> --timeout 3000 2>&1 || true

This spawns a separate process with its own event loop. The timeout error is cosmetic — the job always runs. But it's noisy and confusing for LLM agents that may interpret the timeout as a failure and retry.

Impact

  • Every cron tool call from an active session hits this
  • Risk of duplicate jobs if LLM retries on false timeout
  • Affects cron.run, cron.list, cron.add, and potentially other gateway-native tool calls under load
  • Users/agents must use CLI workaround, which adds latency and error noise

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions