You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provider-qualified default model resolution eagerly builds and normalizes the full configured-model alias index on the inbound reply hot path; with a 97-entry model catalog this blocked the gateway event loop for ~80-85s before the embedded agent startup logs appeared.
Steps to reproduce
Run OpenClaw from source at 87cd6b3e923fcb8a4869dc35e5b582103be85e51 / package version 2026.5.24-beta.1 on Linux with the gateway daemon.
Configure the default agent model as a provider-qualified primary model, for example:
agents.defaults.model.primary = "openai/gpt-5.5"
the routed agent inherits or uses the same provider-qualified default
Configure a larger agents.defaults.models catalog. The observed case has 97 entries.
Send a normal inbound WhatsApp group message that does not contain a model directive, heartbeat override, or explicit model selection.
Observe that the gateway logs event-loop starvation before the embedded agent startup-stage trace appears.
Profile the same config with a read-only harness around the model-selection helpers. In the observed config:
parsing the first 20 alias keys with plugin normalization enabled: 29418.9ms
the same 20-key parse with plugin normalization disabled: 0.4ms
Expected behavior
A normal inbound reply using an already provider-qualified default model should not synchronously build and normalize the full configured-model alias index before starting the agent run.
In this path, OpenClaw should resolve openai/gpt-5.5 cheaply, only build alias data if an alias is actually needed, and avoid blocking the gateway event loop long enough to delay unrelated timers and channel health checks.
Actual behavior
The inbound reply path calls resolveDefaultModel() before model directives are known to be needed. That helper resolves the default model and also eagerly builds the full alias index.
Observed source path at 87cd6b3e923fcb8a4869dc35e5b582103be85e51:
src/auto-reply/reply/get-reply.ts:252 calls resolveDefaultModel({ cfg, agentId }) for every inbound reply.
src/auto-reply/reply/directive-handling.defaults.ts:13-22 calls both resolveDefaultModelForAgent(...) and buildModelAliasIndex(...).
src/agents/model-selection-shared.ts:572-581 builds the alias index before checking whether the configured default model already contains a provider slash.
src/agents/model-selection-shared.ts:401-420 parses/normalizes each configured-model key before checking whether the entry actually has an alias.
This made the user-visible "pre-agent" delay much larger than the later embedded startup-stage trace suggested. The startup-stage trace accounted for about 18s, while the liveness/fetch-timeout logs showed the event loop had already been blocked for ~80-85s.
OpenClaw version
2026.5.24-beta.1 from source checkout commit 87cd6b3e923fcb8a4869dc35e5b582103be85e51.
Operating system
Ubuntu Linux, kernel 6.17.0-14-generic, x86_64.
Install method
Source checkout built into a local gateway daemon/runtime.
Model
openai/gpt-5.5
Provider / routing chain
OpenClaw gateway -> OpenAI provider, using a provider-qualified configured model ID. No model-router/proxy behavior is required to reproduce the model-resolution overhead.
Additional provider/model setup details
The default model was already configured as openai/gpt-5.5, and the routed agent used the same effective default. The config also had 97 entries under agents.defaults.models.
The slow path appears tied to configured-model normalization and alias-index construction, not to the model provider call itself. The delay happens before useful agent execution starts.
Logs, screenshots, and evidence
# Sanitized gateway evidence from a real inbound WhatsApp group reply.# Private channel IDs, hostnames, and token-bearing URLs are intentionally omitted.
[diagnostic] liveness warning:
reasons=event_loop_delay,event_loop_utilization,cpu
eventLoopDelayP99Ms=85765.1
eventLoopDelayMaxMs=85765.1
eventLoopUtilization=0.999
active=1
work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=115s last=embedded_run:started)]
[fetch-timeout] fetch timeout after 10000ms:
elapsed=45329ms
timer delayed=35329ms
[fetch-timeout] fetch timeout after 10000ms:
elapsed=85268ms
timer delayed=75268ms
[diagnostic] liveness warning:
eventLoopDelayP99Ms=85228.3
eventLoopUtilization=1
active=1
work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=114s last=embedded_run:started)]
[agent/embedded] [trace:embedded-run] startup stages:
totalMs=18092
stages=workspace:1ms,
runtime-plugins:1ms,
hooks:0ms,
model-resolution:1236ms,
auth:2033ms,
context-engine:1ms,
attempt-workspace:14817ms,
attempt-prompt:0ms,
attempt-runtime-plan:3ms,
attempt-dispatch:0ms
# Read-only local timings against the same config:
resolveDefaultModelForAgent({ cfg, agentId: "main" }): 50154.7ms
buildModelAliasIndex({ cfg, defaultProvider: "openai" }): 38774.5ms
buildConfiguredModelCatalog({ cfg }): 0.7ms
parse first 20 alias keys with plugin normalization: 29418.9ms
parse first 20 alias keys with plugin normalization disabled: 0.4ms
Impact and severity
Affected: gateway inbound reply handling, observed through WhatsApp group messages routed to an embedded agent.
Severity: high availability/performance issue. The gateway event loop was blocked long enough to delay timers, produce channel fetch timeouts, and make the agent appear silent before it had meaningfully started.
Frequency: observed on 2/2 inbound messages in this configuration while investigating the incident. The cost scales with configured-model catalog size and plugin/model normalization work.
Consequence: messages can sit for over a minute before visible agent progress, and other gateway/channel work can time out because the Node event loop is saturated.
Additional information
This does not look like a WhatsApp-specific bug. WhatsApp made the symptom visible, but the hot path is shared model/default resolution before agent startup.
Related upstream context:
perf(agents): reuse manifest metadata during model resolution #86552 (perf(agents): reuse manifest metadata during model resolution) overlaps with repeated manifest metadata loading and should reduce part of the cost, but this report covers an additional eager-work issue: alias-index construction and per-entry parsing happen even when the default model is already provider-qualified.
Split default-model resolution from alias-index construction on the inbound reply path. Return a lazy alias-index getter or only build aliases once a directive/override path actually needs aliases.
In resolveConfiguredModelRef, avoid building a full alias index before handling provider-qualified values like openai/gpt-5.5. If exact alias matching must remain ahead of provider-qualified parsing for compatibility, scan alias strings cheaply and only parse the matched entry.
In buildModelAliasIndex, read entryRaw.alias before parsing the model key, and skip normalization entirely for entries without aliases.
Add a regression test with a large agents.defaults.models catalog where most entries do not define alias, asserting that provider-qualified default resolution does not call plugin/manifest normalization for every catalog entry.
Last known good version: NOT_ENOUGH_INFO
First known bad version: NOT_ENOUGH_INFO
AI assistance disclosure: this report was prepared with AI assistance and manually checked against local source, sanitized logs, and read-only timing evidence.
Bug type
Crash (process/app exits or hangs)
Beta release blocker
No
Summary
Provider-qualified default model resolution eagerly builds and normalizes the full configured-model alias index on the inbound reply hot path; with a 97-entry model catalog this blocked the gateway event loop for ~80-85s before the embedded agent startup logs appeared.
Steps to reproduce
87cd6b3e923fcb8a4869dc35e5b582103be85e51/ package version2026.5.24-beta.1on Linux with the gateway daemon.agents.defaults.model.primary = "openai/gpt-5.5"agents.defaults.modelscatalog. The observed case has 97 entries.resolveDefaultModelForAgent({ cfg, agentId: "main" }):50154.7msbuildModelAliasIndex({ cfg, defaultProvider: "openai" }):38774.5msbuildConfiguredModelCatalog({ cfg }):0.7ms29418.9ms0.4msExpected behavior
A normal inbound reply using an already provider-qualified default model should not synchronously build and normalize the full configured-model alias index before starting the agent run.
In this path, OpenClaw should resolve
openai/gpt-5.5cheaply, only build alias data if an alias is actually needed, and avoid blocking the gateway event loop long enough to delay unrelated timers and channel health checks.Actual behavior
The inbound reply path calls
resolveDefaultModel()before model directives are known to be needed. That helper resolves the default model and also eagerly builds the full alias index.Observed source path at
87cd6b3e923fcb8a4869dc35e5b582103be85e51:src/auto-reply/reply/get-reply.ts:252callsresolveDefaultModel({ cfg, agentId })for every inbound reply.src/auto-reply/reply/directive-handling.defaults.ts:13-22calls bothresolveDefaultModelForAgent(...)andbuildModelAliasIndex(...).src/agents/model-selection-shared.ts:572-581builds the alias index before checking whether the configured default model already contains a provider slash.src/agents/model-selection-shared.ts:401-420parses/normalizes each configured-model key before checking whether the entry actually has analias.This made the user-visible "pre-agent" delay much larger than the later embedded startup-stage trace suggested. The startup-stage trace accounted for about 18s, while the liveness/fetch-timeout logs showed the event loop had already been blocked for ~80-85s.
OpenClaw version
2026.5.24-beta.1from source checkout commit87cd6b3e923fcb8a4869dc35e5b582103be85e51.Operating system
Ubuntu Linux, kernel
6.17.0-14-generic, x86_64.Install method
Source checkout built into a local gateway daemon/runtime.
Model
openai/gpt-5.5Provider / routing chain
OpenClaw gateway -> OpenAI provider, using a provider-qualified configured model ID. No model-router/proxy behavior is required to reproduce the model-resolution overhead.
Additional provider/model setup details
The default model was already configured as
openai/gpt-5.5, and the routed agent used the same effective default. The config also had 97 entries underagents.defaults.models.The slow path appears tied to configured-model normalization and alias-index construction, not to the model provider call itself. The delay happens before useful agent execution starts.
Logs, screenshots, and evidence
Impact and severity
Affected: gateway inbound reply handling, observed through WhatsApp group messages routed to an embedded agent.
Severity: high availability/performance issue. The gateway event loop was blocked long enough to delay timers, produce channel fetch timeouts, and make the agent appear silent before it had meaningfully started.
Frequency: observed on 2/2 inbound messages in this configuration while investigating the incident. The cost scales with configured-model catalog size and plugin/model normalization work.
Consequence: messages can sit for over a minute before visible agent progress, and other gateway/channel work can time out because the Node event loop is saturated.
Additional information
This does not look like a WhatsApp-specific bug. WhatsApp made the symptom visible, but the hot path is shared model/default resolution before agent startup.
Related upstream context:
perf(agents): reuse manifest metadata during model resolution) overlaps with repeated manifest metadata loading and should reduce part of the cost, but this report covers an additional eager-work issue: alias-index construction and per-entry parsing happen even when the default model is already provider-qualified.perf(gateway): propagate config context in model normalization to avoid stale policy warning) is related model-normalization context work, but it does not by itself address eager alias-index construction on inbound replies.DefaultResourceLoader.reload()/attempt-workspaceblocking, which matches the later ~15sattempt-workspaceslice in the startup trace, but it does not explain the earlier ~80s pre-agent event-loop delay.Potential fix direction:
resolveConfiguredModelRef, avoid building a full alias index before handling provider-qualified values likeopenai/gpt-5.5. If exact alias matching must remain ahead of provider-qualified parsing for compatibility, scan alias strings cheaply and only parse the matched entry.buildModelAliasIndex, readentryRaw.aliasbefore parsing the model key, and skip normalization entirely for entries without aliases.agents.defaults.modelscatalog where most entries do not definealias, asserting that provider-qualified default resolution does not call plugin/manifest normalization for every catalog entry.Last known good version:
NOT_ENOUGH_INFOFirst known bad version:
NOT_ENOUGH_INFOAI assistance disclosure: this report was prepared with AI assistance and manually checked against local source, sanitized logs, and read-only timing evidence.