Skip to content

Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models #63357

@tmote

Description

@tmote

Bug: Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models

Summary

On a CPU-only host, every agent turn pays a ~17 second per-turn initialization cost between the channel adapter receiving the inbound message and the user message being written to the session JSONL. This happens BEFORE the LLM is called. With a fast cloud model like Claude Haiku 4.5 (whose own gateway-side LLM call measures ~1-2s), this overhead dominates end-to-end latency to the point where Teams users see ~12s minimum response times for trivial messages.

Environment

  • OpenClaw 2026.4.5 (3e72c03)
  • Ubuntu 24.04.4 LTS, Node v24.14.1 via nvm
  • Hardware: 8 vCPU Intel Xeon Gold 6130 @ 2.10 GHz (2017 Skylake, AVX-512), no GPU, 16 GB RAM, no swap
  • Gateway runs as system-systemd unit (User=pleresadmin, LoadCredentialEncrypted= for secrets)
  • LiteLLM proxy on 127.0.0.1:4000 fronting Anthropic + other providers
  • Cloudflare Tunnel inbound for msteams

The router agent (where this manifests)

A router agent configured to triage inbound messages and delegate via sessions_spawn:

{
  "id": "router",
  "name": "router",
  "model": "litellm/claude-haiku-4-5",
  "memorySearch": { "enabled": false },
  "skills": [],
  "tools": {
    "profile": "minimal",
    "alsoAllow": ["sessions_spawn"],
    "deny": ["read","edit","write","exec","process","canvas","nodes","cron","message","tts","gateway","agents_list","sessions_list","sessions_history","sessions_send","sessions_yield","subagents","session_status","web_search","web_fetch","image","pdf","browser"]
  },
  "subagents": {
    "allowAgents": ["pleres","family","code","research"],
    "requireAgentId": true
  }
}

Steps to reproduce

  1. Configure an agent with the minimal config above (or any agent on a CPU-only host).
  2. Send a Teams "Hi" message via the configured msteams binding.
  3. Observe in /tmp/openclaw/openclaw-<date>.log:
    2026-04-08T19:49:04  [msteams]  received message
    2026-04-08T19:49:04  [msteams]  dispatching to agent
    ...                  (~17 seconds of silence)
    
  4. Observe in ~/.openclaw/agents/router/sessions/<session>.jsonl:
    "timestamp": "2026-04-08T19:49:21.522Z"   # user message written
    "timestamp": "2026-04-08T19:49:23.215Z"   # assistant reply written (LLM took 1.7s)
    
  5. Observe [msteams] dispatch complete at 19:49:24.

Total dispatch time: ~20 seconds. LLM call: ~2 seconds. Per-turn startup overhead: ~17 seconds.

What we ruled out

We isolated this carefully across many iterations:

  • Not memorySearch: disabling memorySearch.enabled on the agent saved ~10s (22s → 12s) but a ~10s residual remained. Even with memorySearch fully off, the per-turn overhead is still ~17s before LLM call.
  • Not MCP servers: removed all MCP servers (microsoft-azure, microsoft-mssql, microsoft-office365) — they were each adding ~30s of bundle-mcp timeout per call. After removing them, the residual ~17s overhead remained.
  • Not tool schemas: tools.profile=minimal + explicit deny of 23 tools (only sessions_spawn allowed) — verified via tcpdump that the structured tools list passed to the LLM is small. Per-turn overhead unchanged.
  • Not bootstrap files: trimmed the agent's workspace bootstrap from 5668 chars (~1417 tokens) to 1447 chars (~362 tokens). Per-turn overhead unchanged.
  • Not humanDelay / blockStreamingDefault / typingMode: set humanDelay.mode=off, blockStreamingDefault=off, typingMode=instant. Per-turn overhead unchanged.
  • Not the LLM call itself: Direct LiteLLM curl to claude-haiku-4-5 returns in 0.6-2s. Direct curl to ollama for the local Qwen wrapper returns in ~2s warm. The LLM is fast.
  • Not BFS/cloudflared inbound transit: tcpdump on loopback :3978 shows the request reaching the gateway within ~1s of the user hitting send in Teams. The gap is INSIDE the gateway after received messagedispatching to agent, before any LLM call.
  • Not CLI cold start: tested via the gateway path (no --local, no CLI invocation per call). The per-turn overhead is observed via real Teams traffic going through cloudflared → msteams provider → gateway.

What seems to be happening in the 17s

We don't have visibility past the dispatching to agent log line at the default log level. The default log shows nothing during the gap between dispatch start and dispatch completion. Suspected sources (in priority order):

  1. System prompt / agent context build per turn — re-reading workspace bootstrap files, computing tool schemas, building the structured system message
  2. Subagent capability registration for sessions_spawn
  3. Provider client warmup / token refresh
  4. Some other synchronous initialization step

A debug-level log of the agent run pipeline (or a [agent.run] initialized in <ms>ms line bracketing the per-turn setup vs the LLM call) would make this trivial to diagnose for users.

Severity

Medium-high for CPU-only deployments. The 17s per-turn overhead caps the achievable end-to-end latency at ~20s for any message, regardless of how fast the LLM is. This makes OpenClaw effectively unusable for interactive personal assistant use cases on CPU-only hardware (even with cloud models). It also makes the value of fast/cheap cloud models like Haiku 4.5 invisible to operators on CPU hosts.

Workarounds attempted (none worked)

  • All the tunables listed in "What we ruled out" above
  • Restarting the gateway between calls — the overhead is per-turn, not first-call only
  • Using a streaming-disabled model entry (streaming: false) — no effect on the pre-LLM overhead

Suggested next actions

  1. Add structured timing logs at info level for the per-turn agent run pipeline:
    • [agent.run] context built in <ms>ms
    • [agent.run] tool schemas in <ms>ms
    • [agent.run] llm call started
    • [agent.run] llm call completed in <ms>ms
      This would let users diagnose without filing a vague issue like this one.
  2. Profile a single agent turn on a CPU-only host to find the actual hot spot
  3. Cache per-turn invariants (system prompt template, tool schemas) so they're built once at agent registration time, not on every turn

Hardware note

This is a 2017-era Skylake CPU with no GPU. Faster CPUs (Sapphire Rapids, Genoa) would likely amortize this cost differently, but the issue is still that per-turn work scales with CPU clock instead of being effectively constant. For CPU-bound operators, 17s/turn is a ceiling that no model selection can fix.

Where we landed

We pivoted our router from local Qwen 2.5 7B (which was even slower on this CPU) to cloud Claude Haiku 4.5 expecting sub-3s end-to-end latency. We achieved sub-2s LLM time but the OpenClaw runtime adds 17s on top, giving us ~12s end-to-end perceived latency. This is acceptable but well below what's possible if the per-turn overhead were addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions