[Bug]: Provider-qualified default model resolution eagerly builds alias index and can block gateway event loop ~80s

## Bug type

Crash (process/app exits or hangs)

## Beta release blocker

No

## Summary

Provider-qualified default model resolution eagerly builds and normalizes the full configured-model alias index on the inbound reply hot path; with a 97-entry model catalog this blocked the gateway event loop for ~80-85s before the embedded agent startup logs appeared.

## Steps to reproduce

1. Run OpenClaw from source at `87cd6b3e923fcb8a4869dc35e5b582103be85e51` / package version `2026.5.24-beta.1` on Linux with the gateway daemon.
2. Configure the default agent model as a provider-qualified primary model, for example:
   - `agents.defaults.model.primary = "openai/gpt-5.5"`
   - the routed agent inherits or uses the same provider-qualified default
3. Configure a larger `agents.defaults.models` catalog. The observed case has 97 entries.
4. Send a normal inbound WhatsApp group message that does not contain a model directive, heartbeat override, or explicit model selection.
5. Observe that the gateway logs event-loop starvation before the embedded agent startup-stage trace appears.
6. Profile the same config with a read-only harness around the model-selection helpers. In the observed config:
   - `resolveDefaultModelForAgent({ cfg, agentId: "main" })`: `50154.7ms`
   - `buildModelAliasIndex({ cfg, defaultProvider: "openai" })`: `38774.5ms`
   - `buildConfiguredModelCatalog({ cfg })`: `0.7ms`
   - parsing the first 20 alias keys with plugin normalization enabled: `29418.9ms`
   - the same 20-key parse with plugin normalization disabled: `0.4ms`

## Expected behavior

A normal inbound reply using an already provider-qualified default model should not synchronously build and normalize the full configured-model alias index before starting the agent run.

In this path, OpenClaw should resolve `openai/gpt-5.5` cheaply, only build alias data if an alias is actually needed, and avoid blocking the gateway event loop long enough to delay unrelated timers and channel health checks.

## Actual behavior

The inbound reply path calls `resolveDefaultModel()` before model directives are known to be needed. That helper resolves the default model and also eagerly builds the full alias index.

Observed source path at `87cd6b3e923fcb8a4869dc35e5b582103be85e51`:

- `src/auto-reply/reply/get-reply.ts:252` calls `resolveDefaultModel({ cfg, agentId })` for every inbound reply.
- `src/auto-reply/reply/directive-handling.defaults.ts:13-22` calls both `resolveDefaultModelForAgent(...)` and `buildModelAliasIndex(...)`.
- `src/agents/model-selection-shared.ts:572-581` builds the alias index before checking whether the configured default model already contains a provider slash.
- `src/agents/model-selection-shared.ts:401-420` parses/normalizes each configured-model key before checking whether the entry actually has an `alias`.

This made the user-visible "pre-agent" delay much larger than the later embedded startup-stage trace suggested. The startup-stage trace accounted for about 18s, while the liveness/fetch-timeout logs showed the event loop had already been blocked for ~80-85s.

## OpenClaw version

`2026.5.24-beta.1` from source checkout commit `87cd6b3e923fcb8a4869dc35e5b582103be85e51`.

## Operating system

Ubuntu Linux, kernel `6.17.0-14-generic`, x86_64.

## Install method

Source checkout built into a local gateway daemon/runtime.

## Model

`openai/gpt-5.5`

## Provider / routing chain

OpenClaw gateway -> OpenAI provider, using a provider-qualified configured model ID. No model-router/proxy behavior is required to reproduce the model-resolution overhead.

## Additional provider/model setup details

The default model was already configured as `openai/gpt-5.5`, and the routed agent used the same effective default. The config also had 97 entries under `agents.defaults.models`.

The slow path appears tied to configured-model normalization and alias-index construction, not to the model provider call itself. The delay happens before useful agent execution starts.

## Logs, screenshots, and evidence

```shell
# Sanitized gateway evidence from a real inbound WhatsApp group reply.
# Private channel IDs, hostnames, and token-bearing URLs are intentionally omitted.

[diagnostic] liveness warning:
  reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=85765.1
  eventLoopDelayMaxMs=85765.1
  eventLoopUtilization=0.999
  active=1
  work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=115s last=embedded_run:started)]

[fetch-timeout] fetch timeout after 10000ms:
  elapsed=45329ms
  timer delayed=35329ms

[fetch-timeout] fetch timeout after 10000ms:
  elapsed=85268ms
  timer delayed=75268ms

[diagnostic] liveness warning:
  eventLoopDelayP99Ms=85228.3
  eventLoopUtilization=1
  active=1
  work=[active=agent:main:whatsapp:group:<redacted>(processing/embedded_run,q=1,age=114s last=embedded_run:started)]

[agent/embedded] [trace:embedded-run] startup stages:
  totalMs=18092
  stages=workspace:1ms,
    runtime-plugins:1ms,
    hooks:0ms,
    model-resolution:1236ms,
    auth:2033ms,
    context-engine:1ms,
    attempt-workspace:14817ms,
    attempt-prompt:0ms,
    attempt-runtime-plan:3ms,
    attempt-dispatch:0ms

# Read-only local timings against the same config:
resolveDefaultModelForAgent({ cfg, agentId: "main" }): 50154.7ms
buildModelAliasIndex({ cfg, defaultProvider: "openai" }): 38774.5ms
buildConfiguredModelCatalog({ cfg }): 0.7ms
parse first 20 alias keys with plugin normalization: 29418.9ms
parse first 20 alias keys with plugin normalization disabled: 0.4ms
```

## Impact and severity

Affected: gateway inbound reply handling, observed through WhatsApp group messages routed to an embedded agent.

Severity: high availability/performance issue. The gateway event loop was blocked long enough to delay timers, produce channel fetch timeouts, and make the agent appear silent before it had meaningfully started.

Frequency: observed on 2/2 inbound messages in this configuration while investigating the incident. The cost scales with configured-model catalog size and plugin/model normalization work.

Consequence: messages can sit for over a minute before visible agent progress, and other gateway/channel work can time out because the Node event loop is saturated.

## Additional information

This does not look like a WhatsApp-specific bug. WhatsApp made the symptom visible, but the hot path is shared model/default resolution before agent startup.

Related upstream context:

- #86552 (`perf(agents): reuse manifest metadata during model resolution`) overlaps with repeated manifest metadata loading and should reduce part of the cost, but this report covers an additional eager-work issue: alias-index construction and per-entry parsing happen even when the default model is already provider-qualified.
- #86372 (`perf(gateway): propagate config context in model normalization to avoid stale policy warning`) is related model-normalization context work, but it does not by itself address eager alias-index construction on inbound replies.
- #79899 covers `DefaultResourceLoader.reload()` / `attempt-workspace` blocking, which matches the later ~15s `attempt-workspace` slice in the startup trace, but it does not explain the earlier ~80s pre-agent event-loop delay.
- #86509 is a broader event-loop-starvation regression report; this issue is the narrower model-resolution/alias-index hot path with function-level timings.

Potential fix direction:

1. Split default-model resolution from alias-index construction on the inbound reply path. Return a lazy alias-index getter or only build aliases once a directive/override path actually needs aliases.
2. In `resolveConfiguredModelRef`, avoid building a full alias index before handling provider-qualified values like `openai/gpt-5.5`. If exact alias matching must remain ahead of provider-qualified parsing for compatibility, scan alias strings cheaply and only parse the matched entry.
3. In `buildModelAliasIndex`, read `entryRaw.alias` before parsing the model key, and skip normalization entirely for entries without aliases.
4. Reuse the manifest/plugin metadata context once per resolver call, consistent with the approach in #86552.
5. Add a regression test with a large `agents.defaults.models` catalog where most entries do not define `alias`, asserting that provider-qualified default resolution does not call plugin/manifest normalization for every catalog entry.

Last known good version: `NOT_ENOUGH_INFO`

First known bad version: `NOT_ENOUGH_INFO`

AI assistance disclosure: this report was prepared with AI assistance and manually checked against local source, sanitized logs, and read-only timing evidence.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Provider-qualified default model resolution eagerly builds alias index and can block gateway event loop ~80s #86635

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Provider-qualified default model resolution eagerly builds alias index and can block gateway event loop ~80s #86635

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions