Bug: Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models
Summary
On a CPU-only host, every agent turn pays a ~17 second per-turn initialization cost between the channel adapter receiving the inbound message and the user message being written to the session JSONL. This happens BEFORE the LLM is called. With a fast cloud model like Claude Haiku 4.5 (whose own gateway-side LLM call measures ~1-2s), this overhead dominates end-to-end latency to the point where Teams users see ~12s minimum response times for trivial messages.
Environment
- OpenClaw
2026.4.5 (3e72c03)
- Ubuntu 24.04.4 LTS, Node v24.14.1 via nvm
- Hardware: 8 vCPU Intel Xeon Gold 6130 @ 2.10 GHz (2017 Skylake, AVX-512), no GPU, 16 GB RAM, no swap
- Gateway runs as system-systemd unit (
User=pleresadmin, LoadCredentialEncrypted= for secrets)
- LiteLLM proxy on
127.0.0.1:4000 fronting Anthropic + other providers
- Cloudflare Tunnel inbound for msteams
The router agent (where this manifests)
A router agent configured to triage inbound messages and delegate via sessions_spawn:
{
"id": "router",
"name": "router",
"model": "litellm/claude-haiku-4-5",
"memorySearch": { "enabled": false },
"skills": [],
"tools": {
"profile": "minimal",
"alsoAllow": ["sessions_spawn"],
"deny": ["read","edit","write","exec","process","canvas","nodes","cron","message","tts","gateway","agents_list","sessions_list","sessions_history","sessions_send","sessions_yield","subagents","session_status","web_search","web_fetch","image","pdf","browser"]
},
"subagents": {
"allowAgents": ["pleres","family","code","research"],
"requireAgentId": true
}
}
Steps to reproduce
- Configure an agent with the minimal config above (or any agent on a CPU-only host).
- Send a Teams "Hi" message via the configured msteams binding.
- Observe in
/tmp/openclaw/openclaw-<date>.log:
2026-04-08T19:49:04 [msteams] received message
2026-04-08T19:49:04 [msteams] dispatching to agent
... (~17 seconds of silence)
- Observe in
~/.openclaw/agents/router/sessions/<session>.jsonl:
"timestamp": "2026-04-08T19:49:21.522Z" # user message written
"timestamp": "2026-04-08T19:49:23.215Z" # assistant reply written (LLM took 1.7s)
- Observe
[msteams] dispatch complete at 19:49:24.
Total dispatch time: ~20 seconds. LLM call: ~2 seconds. Per-turn startup overhead: ~17 seconds.
What we ruled out
We isolated this carefully across many iterations:
- Not memorySearch: disabling
memorySearch.enabled on the agent saved ~10s (22s → 12s) but a ~10s residual remained. Even with memorySearch fully off, the per-turn overhead is still ~17s before LLM call.
- Not MCP servers: removed all MCP servers (
microsoft-azure, microsoft-mssql, microsoft-office365) — they were each adding ~30s of bundle-mcp timeout per call. After removing them, the residual ~17s overhead remained.
- Not tool schemas:
tools.profile=minimal + explicit deny of 23 tools (only sessions_spawn allowed) — verified via tcpdump that the structured tools list passed to the LLM is small. Per-turn overhead unchanged.
- Not bootstrap files: trimmed the agent's workspace bootstrap from 5668 chars (~1417 tokens) to 1447 chars (~362 tokens). Per-turn overhead unchanged.
- Not humanDelay / blockStreamingDefault / typingMode: set
humanDelay.mode=off, blockStreamingDefault=off, typingMode=instant. Per-turn overhead unchanged.
- Not the LLM call itself: Direct LiteLLM curl to
claude-haiku-4-5 returns in 0.6-2s. Direct curl to ollama for the local Qwen wrapper returns in ~2s warm. The LLM is fast.
- Not BFS/cloudflared inbound transit: tcpdump on loopback
:3978 shows the request reaching the gateway within ~1s of the user hitting send in Teams. The gap is INSIDE the gateway after received message → dispatching to agent, before any LLM call.
- Not CLI cold start: tested via the gateway path (no
--local, no CLI invocation per call). The per-turn overhead is observed via real Teams traffic going through cloudflared → msteams provider → gateway.
What seems to be happening in the 17s
We don't have visibility past the dispatching to agent log line at the default log level. The default log shows nothing during the gap between dispatch start and dispatch completion. Suspected sources (in priority order):
- System prompt / agent context build per turn — re-reading workspace bootstrap files, computing tool schemas, building the structured
system message
- Subagent capability registration for
sessions_spawn
- Provider client warmup / token refresh
- Some other synchronous initialization step
A debug-level log of the agent run pipeline (or a [agent.run] initialized in <ms>ms line bracketing the per-turn setup vs the LLM call) would make this trivial to diagnose for users.
Severity
Medium-high for CPU-only deployments. The 17s per-turn overhead caps the achievable end-to-end latency at ~20s for any message, regardless of how fast the LLM is. This makes OpenClaw effectively unusable for interactive personal assistant use cases on CPU-only hardware (even with cloud models). It also makes the value of fast/cheap cloud models like Haiku 4.5 invisible to operators on CPU hosts.
Workarounds attempted (none worked)
- All the tunables listed in "What we ruled out" above
- Restarting the gateway between calls — the overhead is per-turn, not first-call only
- Using a streaming-disabled model entry (
streaming: false) — no effect on the pre-LLM overhead
Suggested next actions
- Add structured timing logs at info level for the per-turn agent run pipeline:
[agent.run] context built in <ms>ms
[agent.run] tool schemas in <ms>ms
[agent.run] llm call started
[agent.run] llm call completed in <ms>ms
This would let users diagnose without filing a vague issue like this one.
- Profile a single agent turn on a CPU-only host to find the actual hot spot
- Cache per-turn invariants (system prompt template, tool schemas) so they're built once at agent registration time, not on every turn
Hardware note
This is a 2017-era Skylake CPU with no GPU. Faster CPUs (Sapphire Rapids, Genoa) would likely amortize this cost differently, but the issue is still that per-turn work scales with CPU clock instead of being effectively constant. For CPU-bound operators, 17s/turn is a ceiling that no model selection can fix.
Where we landed
We pivoted our router from local Qwen 2.5 7B (which was even slower on this CPU) to cloud Claude Haiku 4.5 expecting sub-3s end-to-end latency. We achieved sub-2s LLM time but the OpenClaw runtime adds 17s on top, giving us ~12s end-to-end perceived latency. This is acceptable but well below what's possible if the per-turn overhead were addressed.
Bug: Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models
Summary
On a CPU-only host, every agent turn pays a ~17 second per-turn initialization cost between the channel adapter receiving the inbound message and the user message being written to the session JSONL. This happens BEFORE the LLM is called. With a fast cloud model like Claude Haiku 4.5 (whose own gateway-side LLM call measures ~1-2s), this overhead dominates end-to-end latency to the point where Teams users see ~12s minimum response times for trivial messages.
Environment
2026.4.5 (3e72c03)User=pleresadmin,LoadCredentialEncrypted=for secrets)127.0.0.1:4000fronting Anthropic + other providersThe router agent (where this manifests)
A
routeragent configured to triage inbound messages and delegate viasessions_spawn:{ "id": "router", "name": "router", "model": "litellm/claude-haiku-4-5", "memorySearch": { "enabled": false }, "skills": [], "tools": { "profile": "minimal", "alsoAllow": ["sessions_spawn"], "deny": ["read","edit","write","exec","process","canvas","nodes","cron","message","tts","gateway","agents_list","sessions_list","sessions_history","sessions_send","sessions_yield","subagents","session_status","web_search","web_fetch","image","pdf","browser"] }, "subagents": { "allowAgents": ["pleres","family","code","research"], "requireAgentId": true } }Steps to reproduce
/tmp/openclaw/openclaw-<date>.log:~/.openclaw/agents/router/sessions/<session>.jsonl:[msteams] dispatch completeat19:49:24.Total dispatch time: ~20 seconds. LLM call: ~2 seconds. Per-turn startup overhead: ~17 seconds.
What we ruled out
We isolated this carefully across many iterations:
memorySearch.enabledon the agent saved ~10s (22s → 12s) but a ~10s residual remained. Even with memorySearch fully off, the per-turn overhead is still ~17s before LLM call.microsoft-azure,microsoft-mssql,microsoft-office365) — they were each adding ~30s of bundle-mcp timeout per call. After removing them, the residual ~17s overhead remained.tools.profile=minimal+ explicit deny of 23 tools (onlysessions_spawnallowed) — verified via tcpdump that the structured tools list passed to the LLM is small. Per-turn overhead unchanged.humanDelay.mode=off,blockStreamingDefault=off,typingMode=instant. Per-turn overhead unchanged.claude-haiku-4-5returns in 0.6-2s. Direct curl to ollama for the local Qwen wrapper returns in ~2s warm. The LLM is fast.:3978shows the request reaching the gateway within ~1s of the user hitting send in Teams. The gap is INSIDE the gateway afterreceived message→dispatching to agent, before any LLM call.--local, no CLI invocation per call). The per-turn overhead is observed via real Teams traffic going through cloudflared → msteams provider → gateway.What seems to be happening in the 17s
We don't have visibility past the
dispatching to agentlog line at the default log level. The default log shows nothing during the gap between dispatch start and dispatch completion. Suspected sources (in priority order):systemmessagesessions_spawnA debug-level log of the agent run pipeline (or a
[agent.run] initialized in <ms>msline bracketing the per-turn setup vs the LLM call) would make this trivial to diagnose for users.Severity
Medium-high for CPU-only deployments. The 17s per-turn overhead caps the achievable end-to-end latency at ~20s for any message, regardless of how fast the LLM is. This makes OpenClaw effectively unusable for interactive personal assistant use cases on CPU-only hardware (even with cloud models). It also makes the value of fast/cheap cloud models like Haiku 4.5 invisible to operators on CPU hosts.
Workarounds attempted (none worked)
streaming: false) — no effect on the pre-LLM overheadSuggested next actions
[agent.run] context built in <ms>ms[agent.run] tool schemas in <ms>ms[agent.run] llm call started[agent.run] llm call completed in <ms>msThis would let users diagnose without filing a vague issue like this one.
Hardware note
This is a 2017-era Skylake CPU with no GPU. Faster CPUs (Sapphire Rapids, Genoa) would likely amortize this cost differently, but the issue is still that per-turn work scales with CPU clock instead of being effectively constant. For CPU-bound operators, 17s/turn is a ceiling that no model selection can fix.
Where we landed
We pivoted our router from local Qwen 2.5 7B (which was even slower on this CPU) to cloud Claude Haiku 4.5 expecting sub-3s end-to-end latency. We achieved sub-2s LLM time but the OpenClaw runtime adds 17s on top, giving us ~12s end-to-end perceived latency. This is acceptable but well below what's possible if the per-turn overhead were addressed.