Skip to content

[Bug]: First-load RPC fanout: tts.status monopolizes event loop ~1.5s and applyPluginAutoEnable recomputes 8× per fanout #81355

@xiaohuaxi

Description

@xiaohuaxi

Bug type

Behavior bug

Summary

On cold start, dashboard / UI clients issue 9–10 RPCs concurrently against the gateway. Two independent issues cause this fanout to take 1.3–2.7 s instead of completing in parallel:

  • (A) "tts.status" is declared async but contains zero await expressions. It runs ~1.5 s of synchronous code (TTS config resolution, provider scanning, plus a synchronous readFileSync inside readPrefs) before returning, monopolizing the event loop and starving every sibling handler on the same connection.
  • (B) applyPluginAutoEnable(...) is invoked 8 times per fanout with the same config object reference and the same process.env — ~75 ms × 8 ≈ 600 ms of redundant pure-CPU work.

Together these account for ~2.1 s of avoidable main-thread occupancy on every cold-start fanout. They are logically independent and can be addressed in separate PRs.

Steps to reproduce

  1. Fresh-start the gateway: openclaw gateway restart.
  2. Load the gateway dashboard (or any UI/MCP client that issues the standard read-only fanout).
  3. The dashboard typically issues: sessions.list, status, models.list, usage.cost, tts.status, channels.status, tools.catalog × N agents — all in the same WebSocket frame batch.
  4. Measure per-RPC latency in the client (or instrument the handlers with hrtime probes).

Expected behavior

  • Independent read-only RPCs should complete concurrently; no single handler should block sibling RPCs sharing the same connection.
  • Pure, deterministic helpers like applyPluginAutoEnable should not recompute the same answer 8 times for the same input within a single fanout.

Actual behavior

Measured cold-start fanout on v2026.5.7, gateway freshly restarted, single dashboard load:

Handler RESP time
tts.status 1566 ms
channels.status 646 ms (handler ENTER deferred ~1.5 s after WS frame arrival)
models.list 2177 ms
status 2296 ms
usage.cost 2592 ms
sessions.list 2662 ms
tools.catalog × 3 186 / 216 / 224 ms (serialized, back-to-back)

Total wall time ~2.7 s. A main-thread heartbeat probe (5 ms setTimeout, alerts when the gap exceeds 80 ms) fires continuously across the entire 2.7 s window — the event loop never yields.


Bug (A) — tts.status handler synchronously blocks the event loop ~1.5 s

Source: src/gateway/server-methods/tts.ts:29

"tts.status": async ({ respond, context }) => {
  try {
    const cfg = context.getRuntimeConfig();
    const config = resolveTtsConfig(cfg);                        // ~200 ms
    const prefsPath = resolveTtsPrefsPath(config);
    const provider = getTtsProvider(config, prefsPath);          // ~347 ms (readPrefs → readFileSync)
    const persona = getTtsPersona(config, prefsPath);
    const autoMode = resolveTtsAutoMode({ config, prefsPath });
    const fallbackProviders = resolveTtsProviderOrder(provider, cfg)
      .slice(1)
      .filter((c) => isTtsProviderConfigured(config, c, cfg));   // ~905 ms (15 providers × isConfigured)
    const providerStates = listSpeechProviders(cfg).map(/* isConfigured per provider */); // ~114 ms
    respond(true, { /* ... */ });
  } catch (err) { /* ... */ }
}

The handler is async, but the body contains no await expression. Every helper invoked is synchronous; several call readFileSync (readPrefs in extensions/speech-core/runtime-api.ts) or do synchronous provider enumeration via isConfigured. The handler therefore executes ~1.5 s of pure synchronous CPU + sync I/O on the event-loop thread before returning — no microtask interleaves during this window.

Per-segment probe data (cold-start, gateway-restarted run):

HND tts.status ENTER             @0.0 ms
  TS after getRuntimeConfig      @0.1 ms
  TS after resolveTtsConfig      @198.8 ms    ← 199 ms
  TS after resolveTtsPrefsPath   @199.0 ms
  TS after getTtsProvider        @546.0 ms    ← 347 ms (readFileSync inside readPrefs)
  TS after getTtsPersona         @546.1 ms
  TS after resolveTtsAutoMode    @546.3 ms
  TS after fallbackProviders     @1451.3 ms   ← 905 ms (slowest segment)
  TS after providerStates        @1565.8 ms   ← 114 ms
HND tts.status RESP +1565.8 ms

Because tts.status enters its handler in the same tick as four sibling handlers (sessions.list, status, models.list, usage.cost) but never yields, all sibling handlers' awaits resolve only after tts.status returns. The dashboard's channels.status request, which arrived in the same WS frame batch, does not even enter its handler until 1.5 s after the others. This single handler accounts for the entire "front-block" segment of the cold-start fanout.

Suggested fixes (any subset would help, in roughly descending impact):

  1. Convert the synchronous I/O helpers to async (readPrefsfs.promises.readFile) and await them — yielding several times during the handler's execution.
  2. Parallelize isConfigured across providers (each call is independent of the others) via Promise.all. The current .filter(...isTtsProviderConfigured) is the single largest segment (~900 ms across 15 providers).
  3. Cache isConfigured(provider, cfg) for the lifetime of a single cfg reference — useful because both fallbackProviders and providerStates enumerate the same providers back-to-back.
  4. Even as a stopgap, insert await Promise.resolve() between the heavy synchronous segments to let sibling handlers interleave.

Bug (B) — applyPluginAutoEnable recomputes the same result 8× per fanout

Source: src/config/plugin-auto-enable.apply.ts:34

export function applyPluginAutoEnable(params: {
  config?: OpenClawConfig;
  env?: NodeJS.ProcessEnv;
  manifestRegistry?: PluginManifestRegistry;
}): PluginAutoEnableResult {
  const candidates = detectPluginAutoEnableCandidates(params);
  return materializePluginAutoEnableCandidates({
    config: params.config,
    candidates,
    env: params.env,
    manifestRegistry: params.manifestRegistry,
  });
}

The function is pure on its inputs (config, env, manifestRegistry). During one dashboard fanout, it is invoked 8 times across the read-only RPC paths:

Caller Call count
channels.status (entry + getRuntimeSnapshot inside the handler) 2
tools.catalog × 3 agents (each calls it twice via ensureStandalonePluginToolRegistryLoaded + resolvePluginTools) 6
Total per fanout 8

Identity check via WeakMap instrumentation on the inputs:

  • All 8 calls during a fanout receive the same config object referencecontext.getRuntimeConfig() returns an identity-stable snapshot within a fanout window.
  • All 8 calls receive params.env === process.env (same identity).

So every call recomputes an answer that already exists. Single-call cost is ~75 ms (≈55 ms detect + ≈22 ms materialize), giving 8 × 75 ms ≈ 600 ms of redundant synchronous CPU per fanout.

Suggested fix — two-level WeakMap keyed on object identity:

const cache = new WeakMap<object, WeakMap<object, PluginAutoEnableResult>>();

export function applyPluginAutoEnable(params) {
  const config = params.config;
  const env = params.env;
  if (config && env) {
    let inner = cache.get(config);
    if (!inner) { inner = new WeakMap(); cache.set(config, inner); }
    const hit = inner.get(env);
    if (hit) return hit;
    const result = computeAutoEnable(params);
    inner.set(env, result);
    return result;
  }
  return computeAutoEnable(params);
}

Because both keys are WeakMap-able objects, entries are collected automatically when a new runtime config snapshot rotates in. manifestRegistry is identity-stable for the same config in our measurements, so the two-level key on (config, env) is sufficient; a single-level WeakMap<config, result> would also work in practice and is even simpler.

Measured hit rate on a real fanout: 7 of 8 calls become cache hits, saving ~525 ms.

OpenClaw version

2026.5.7 (commit eeef486449)

Operating system

WSL2 (Ubuntu 24.04 on Windows 11), Node.js v22.21.1

Model

N/A

Provider / routing chain

N/A

Install method

npm install -g openclaw (running as a systemd user service)

Logs, screenshots, and evidence

All latency numbers above come from hrtime probes inserted at the handler call sites in a freshly restarted gateway during a single dashboard load. No sensitive paths or credentials are included.

Additional information

The two bugs compound: while tts.status holds the event loop for ~1.5 s, sibling handlers' lazy-import I/O (statusloadStatusSummaryRuntimeModule, models.listloadModelsListCatalog, etc.) can resolve I/O in the background, but their resumed microtasks queue up behind tts.status. Once tts.status returns, the siblings all resolve nearly simultaneously and immediately encounter the redundant applyPluginAutoEnable work along the channels.status and tools.catalog paths.

Estimated impact of fixing both bugs (extrapolated from the probe data, not measured under a patched build):

  • Fix (A) alone: cold-start fanout total drops from ~2.7 s to ~1.2 s (siblings can finally overlap).
  • Fix (A) + (B): drops to ~500–700 ms.

These two issues are logically independent — they share only the surface symptom ("dashboard cold start feels slow"), not their root cause. We are happy to split them into separate issues if that better fits OpenClaw's triage workflow.


Reported by the CoClaw team.
This issue was discovered while developing @coclaw/openclaw-coclaw, a CoClaw channel plugin for OpenClaw.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions