[Bug]: Isolated agent runs stall in runtime-plugins phase before execution start — recurring across hourly crons on 2026.5.22, no named-plugin diagnostic

## Environment

- **OpenClaw**: 2026.5.22 (a374c3a)
- **Backend**: `cliBackends.claude-cli`
- **Host**: single-VPS Linux deployment, no container
- **Affected runs**: all use `sessionTarget: "isolated"` and `payload.kind: "agentTurn"`
- **Models seen across affected runs**: `claude-haiku-4-5`, `claude-sonnet-4-6`, `claude-opus-4-7` (via `anthropic/` and `claude-cli/` prefixes) — failure is not model-specific

## TL;DR

Cron-triggered isolated agent runs intermittently stall in the gateway's `runtime-plugins` phase, **before the model is invoked**. The failure terminates the run with a generic diagnostic that names only the phase, not the plugin that hung. Affected runs have no `model`, `provider`, or `usage` recorded — confirming the agent turn never started.

Pattern is recurring (not one-off) and post-2026.5.22-install: 4 distinct hourly crons have racked up 11–77 failed runs each over the 5 days since 2026-05-22. Other crons are hit 1–2 times. The same crons also have intermixed green runs in the same window, suggesting a race or external-dependency timeout rather than a deterministic config bug.

This is in the same family as #86239 (pre-turn dispatch failures) but a distinct error surface — that one throws `MissingAgentHarnessError`, this one names `runtime-plugins` as the stalled phase with no further detail.

## Symptom

Failed isolated-session runs end with this diagnostic in the cron run log:

```json
{
  "ts": <ts>,
  "action": "finished",
  "status": "error",
  "error": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
  "diagnostics": {
    "summary": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
    "entries": [
      {
        "ts": <ts>,
        "source": "cron-setup",
        "severity": "error",
        "message": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)"
      }
    ]
  },
  "runAtMs": <ts>,
  "durationMs": 78628,
  "nextRunAtMs": <ts>
}
```

Notable: **no `model`, `provider`, `usage`, or `sessionId` fields** on failed runs — these are present on every successful run. The agent turn was never reached.

`durationMs` on failures clusters around 78–80s, suggesting an internal phase timeout (single observed value to date — would appreciate confirmation of what that timeout is).

## Frequency (5-day window: 2026-05-22 → 2026-05-27)

Counted by grepping `"status":"error"` lines mentioning `runtime-plugins` / `stalled before execution` in each cron's run log under `<workspace>/cron/runs/`:

| Cron | Cadence | Failed runs | Most recent failure |
|---|---|---|---|
| Hourly cron A | hourly | 77 | 2026-05-27 12:23 UTC |
| Hourly cron B | hourly | 74 | 2026-05-27 14:15 UTC |
| Hourly cron C | hourly | 12 | recent |
| Hourly cron D | event-driven | 11 | recent |
| Misc daily/weekly crons | — | 1–2 each | various |

Crons A and B each see roughly one failure every other hour. Because the daily health-audit cron flags only "last run failed" (not "any run in the last 24h failed"), the operator-visible signal under-reports: the morning digest shows 5 failing crons, but the underlying run logs show ~180 total failed runs across the fleet in 5 days.

Affected crons span unrelated payload types, tool surfaces (some have `payload.toolsAllow` locks, some don't), and models — the only common factor is `sessionTarget: "isolated"`.

## Hypothesis (low-confidence, leave for upstream)

The `runtime-plugins` phase appears to load MCP plugins / runtime adapters for the isolated child session. Either:

1. A specific plugin's init is hanging (network call without timeout, file handle contention, mutex), and the phase has no per-plugin timeout — so the whole phase times out generically after ~78s with no indication of *which* plugin stalled.
2. There's a race in the plugin-loader's startup ordering when multiple isolated sessions spin up concurrently (hourly crons firing at `:00` of every hour would line up several at once).

Both are consistent with the observed mix of green and red runs on the same cron. I can't narrow further from the operator side because the diagnostic doesn't name a plugin.

## What's wrong

### 1. The diagnostic is unactionable

Knowing the phase failed isn't enough to fix anything. There are presumably N plugins/adapters loaded per isolated session — one of them is the culprit, but the operator (and downstream tooling) can't tell which. The minimum useful diagnostic would include:

- The list of plugins the phase was attempting to load.
- Which ones completed before the stall.
- The plugin that was in-flight when the phase timed out.

### 2. No per-plugin timeout (apparent)

If the phase aggregate timeout is ~78s and individual plugins have no timeouts of their own, a single slow plugin (e.g. an MCP server doing a network call without a timeout of its own) takes down the whole phase. A bounded per-plugin init budget would isolate the failure to the offending plugin and let the rest of the phase succeed.

### 3. No retry-on-phase-stall

A transient stall in plugin init kills the entire cron run — the operator's only recourse is to wait for the next scheduled run. A bounded retry on phase-stall (configurable, default off) would significantly reduce user-visible failures for transient races without papering over real bugs.

### 4. Under-reporting in cron status

The cron state machine records `lastRunStatus` only. A cron that fails 3 of 5 runs in an hour and succeeds on the 4th shows `lastRunStatus: ok` — and the daily health audit treats it as healthy. Adding a `recentRunFailureRate` (or similar) over a rolling window would surface chronic-but-self-recovering failures like this one without requiring operators to grep run logs by hand.

## Asks

1. **Name the failing plugin.** Add per-plugin diagnostics to the `runtime-plugins` phase so the in-flight plugin at stall time appears in the error. This is the single highest-value fix.
2. **Add per-plugin init timeouts.** Bound each plugin's init independently so a single slow plugin can't take down the whole phase.
3. **Consider retry-on-phase-stall.** Configurable, default off. Helps transient-race scenarios without masking real bugs.
4. **Expose a failure-rate metric on cron state.** `lastRunStatus` alone hides chronic flaky crons that intermittently self-recover.
5. **Confirm relationship to #86239.** Same family (failures before agent turn starts) but a distinct error surface. Likely separate root causes but worth a sanity check from someone with the gateway internals in head.

## Related

- **#86239** — `MissingAgentHarnessError` on inbound dispatch under event-loop starvation. Same family (pre-turn failure), different error string, different code path. Open.
- **#86227** — claude-cli harness lazy registration silent drop. Closed as fixed in 2026.5.22. This issue is on 2026.5.22 (a374c3a), so the harness fix is in — this is a different pre-turn failure mode.

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Isolated agent runs stall in runtime-plugins phase before execution start — recurring across hourly crons on 2026.5.22, no named-plugin diagnostic #87327

Environment

TL;DR

Symptom

Frequency (5-day window: 2026-05-22 → 2026-05-27)

Hypothesis (low-confidence, leave for upstream)

What's wrong

1. The diagnostic is unactionable

2. No per-plugin timeout (apparent)

3. No retry-on-phase-stall

4. Under-reporting in cron status

Asks

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cron	Cadence	Failed runs	Most recent failure
Hourly cron A	hourly	77	2026-05-27 12:23 UTC
Hourly cron B	hourly	74	2026-05-27 14:15 UTC
Hourly cron C	hourly	12	recent
Hourly cron D	event-driven	11	recent
Misc daily/weekly crons	—	1–2 each	various

Uh oh!

[Bug]: Isolated agent runs stall in runtime-plugins phase before execution start — recurring across hourly crons on 2026.5.22, no named-plugin diagnostic #87327

Description

Environment

TL;DR

Symptom

Frequency (5-day window: 2026-05-22 → 2026-05-27)

Hypothesis (low-confidence, leave for upstream)

What's wrong

1. The diagnostic is unactionable

2. No per-plugin timeout (apparent)

3. No retry-on-phase-stall

4. Under-reporting in cron status

Asks

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions