Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models

## Bug: Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models

### Summary

On a CPU-only host, every agent turn pays a ~17 second per-turn initialization cost between the channel adapter receiving the inbound message and the user message being written to the session JSONL. This happens BEFORE the LLM is called. With a fast cloud model like Claude Haiku 4.5 (whose own gateway-side LLM call measures ~1-2s), this overhead dominates end-to-end latency to the point where Teams users see ~12s minimum response times for trivial messages.

### Environment

- OpenClaw `2026.4.5 (3e72c03)`
- Ubuntu 24.04.4 LTS, Node v24.14.1 via nvm
- Hardware: 8 vCPU Intel Xeon Gold 6130 @ 2.10 GHz (2017 Skylake, AVX-512), **no GPU**, 16 GB RAM, no swap
- Gateway runs as system-systemd unit (`User=pleresadmin`, `LoadCredentialEncrypted=` for secrets)
- LiteLLM proxy on `127.0.0.1:4000` fronting Anthropic + other providers
- Cloudflare Tunnel inbound for msteams

### The router agent (where this manifests)

A `router` agent configured to triage inbound messages and delegate via `sessions_spawn`:

```json
{
  "id": "router",
  "name": "router",
  "model": "litellm/claude-haiku-4-5",
  "memorySearch": { "enabled": false },
  "skills": [],
  "tools": {
    "profile": "minimal",
    "alsoAllow": ["sessions_spawn"],
    "deny": ["read","edit","write","exec","process","canvas","nodes","cron","message","tts","gateway","agents_list","sessions_list","sessions_history","sessions_send","sessions_yield","subagents","session_status","web_search","web_fetch","image","pdf","browser"]
  },
  "subagents": {
    "allowAgents": ["pleres","family","code","research"],
    "requireAgentId": true
  }
}
```

### Steps to reproduce

1. Configure an agent with the minimal config above (or any agent on a CPU-only host).
2. Send a Teams "Hi" message via the configured msteams binding.
3. Observe in `/tmp/openclaw/openclaw-<date>.log`:
   ```
   2026-04-08T19:49:04  [msteams]  received message
   2026-04-08T19:49:04  [msteams]  dispatching to agent
   ...                  (~17 seconds of silence)
   ```
4. Observe in `~/.openclaw/agents/router/sessions/<session>.jsonl`:
   ```
   "timestamp": "2026-04-08T19:49:21.522Z"   # user message written
   "timestamp": "2026-04-08T19:49:23.215Z"   # assistant reply written (LLM took 1.7s)
   ```
5. Observe `[msteams] dispatch complete` at `19:49:24`.

**Total dispatch time: ~20 seconds. LLM call: ~2 seconds. Per-turn startup overhead: ~17 seconds.**

### What we ruled out

We isolated this carefully across many iterations:

- **Not memorySearch**: disabling `memorySearch.enabled` on the agent saved ~10s (22s → 12s) but a ~10s residual remained. Even with memorySearch fully off, the per-turn overhead is still ~17s before LLM call.
- **Not MCP servers**: removed all MCP servers (`microsoft-azure`, `microsoft-mssql`, `microsoft-office365`) — they were each adding ~30s of bundle-mcp timeout per call. After removing them, the residual ~17s overhead remained.
- **Not tool schemas**: `tools.profile=minimal` + explicit deny of 23 tools (only `sessions_spawn` allowed) — verified via tcpdump that the structured tools list passed to the LLM is small. Per-turn overhead unchanged.
- **Not bootstrap files**: trimmed the agent's workspace bootstrap from 5668 chars (~1417 tokens) to 1447 chars (~362 tokens). Per-turn overhead unchanged.
- **Not humanDelay / blockStreamingDefault / typingMode**: set `humanDelay.mode=off`, `blockStreamingDefault=off`, `typingMode=instant`. Per-turn overhead unchanged.
- **Not the LLM call itself**: Direct LiteLLM curl to `claude-haiku-4-5` returns in 0.6-2s. Direct curl to ollama for the local Qwen wrapper returns in ~2s warm. The LLM is fast.
- **Not BFS/cloudflared inbound transit**: tcpdump on loopback `:3978` shows the request reaching the gateway within ~1s of the user hitting send in Teams. The gap is INSIDE the gateway after `received message` → `dispatching to agent`, before any LLM call.
- **Not CLI cold start**: tested via the gateway path (no `--local`, no CLI invocation per call). The per-turn overhead is observed via real Teams traffic going through cloudflared → msteams provider → gateway.

### What seems to be happening in the 17s

We don't have visibility past the `dispatching to agent` log line at the default log level. The default log shows nothing during the gap between dispatch start and dispatch completion. Suspected sources (in priority order):

1. **System prompt / agent context build per turn** — re-reading workspace bootstrap files, computing tool schemas, building the structured `system` message
2. **Subagent capability registration** for `sessions_spawn`
3. **Provider client warmup / token refresh**
4. Some other synchronous initialization step

A debug-level log of the agent run pipeline (or a `[agent.run] initialized in <ms>ms` line bracketing the per-turn setup vs the LLM call) would make this trivial to diagnose for users.

### Severity

Medium-high for CPU-only deployments. The 17s per-turn overhead caps the achievable end-to-end latency at ~20s for any message, regardless of how fast the LLM is. This makes OpenClaw effectively unusable for interactive personal assistant use cases on CPU-only hardware (even with cloud models). It also makes the value of fast/cheap cloud models like Haiku 4.5 invisible to operators on CPU hosts.

### Workarounds attempted (none worked)

- All the tunables listed in "What we ruled out" above
- Restarting the gateway between calls — the overhead is per-turn, not first-call only
- Using a streaming-disabled model entry (`streaming: false`) — no effect on the pre-LLM overhead

### Suggested next actions

1. **Add structured timing logs** at info level for the per-turn agent run pipeline:
   - `[agent.run] context built in <ms>ms`
   - `[agent.run] tool schemas in <ms>ms`
   - `[agent.run] llm call started`
   - `[agent.run] llm call completed in <ms>ms`
   This would let users diagnose without filing a vague issue like this one.
2. **Profile a single agent turn on a CPU-only host** to find the actual hot spot
3. **Cache per-turn invariants** (system prompt template, tool schemas) so they're built once at agent registration time, not on every turn

### Hardware note

This is a 2017-era Skylake CPU with no GPU. Faster CPUs (Sapphire Rapids, Genoa) would likely amortize this cost differently, but the issue is still that per-turn work scales with CPU clock instead of being effectively constant. For CPU-bound operators, 17s/turn is a ceiling that no model selection can fix.

### Where we landed

We pivoted our router from local Qwen 2.5 7B (which was even slower on this CPU) to cloud Claude Haiku 4.5 expecting sub-3s end-to-end latency. We achieved sub-2s LLM time but the OpenClaw runtime adds 17s on top, giving us ~12s end-to-end perceived latency. This is acceptable but well below what's possible if the per-turn overhead were addressed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models #63357

Bug: Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models

Summary

Environment

The router agent (where this manifests)

Steps to reproduce

What we ruled out

What seems to be happening in the 17s

Severity

Workarounds attempted (none worked)

Suggested next actions

Hardware note

Where we landed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models #63357

Description

Bug: Agent runtime per-turn startup overhead is ~17s on CPU-only systems, dominating end-to-end latency for fast cloud models

Summary

Environment

The router agent (where this manifests)

Steps to reproduce

What we ruled out

What seems to be happening in the 17s

Severity

Workarounds attempted (none worked)

Suggested next actions

Hardware note

Where we landed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions