Skip to content

[Bug]: Isolated agent runs stall in runtime-plugins phase before execution start — recurring across hourly crons on 2026.5.22, no named-plugin diagnostic #87327

@BryceMurray

Description

@BryceMurray

Environment

  • OpenClaw: 2026.5.22 (a374c3a)
  • Backend: cliBackends.claude-cli
  • Host: single-VPS Linux deployment, no container
  • Affected runs: all use sessionTarget: "isolated" and payload.kind: "agentTurn"
  • Models seen across affected runs: claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-7 (via anthropic/ and claude-cli/ prefixes) — failure is not model-specific

TL;DR

Cron-triggered isolated agent runs intermittently stall in the gateway's runtime-plugins phase, before the model is invoked. The failure terminates the run with a generic diagnostic that names only the phase, not the plugin that hung. Affected runs have no model, provider, or usage recorded — confirming the agent turn never started.

Pattern is recurring (not one-off) and post-2026.5.22-install: 4 distinct hourly crons have racked up 11–77 failed runs each over the 5 days since 2026-05-22. Other crons are hit 1–2 times. The same crons also have intermixed green runs in the same window, suggesting a race or external-dependency timeout rather than a deterministic config bug.

This is in the same family as #86239 (pre-turn dispatch failures) but a distinct error surface — that one throws MissingAgentHarnessError, this one names runtime-plugins as the stalled phase with no further detail.

Symptom

Failed isolated-session runs end with this diagnostic in the cron run log:

{
  "ts": <ts>,
  "action": "finished",
  "status": "error",
  "error": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
  "diagnostics": {
    "summary": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)",
    "entries": [
      {
        "ts": <ts>,
        "source": "cron-setup",
        "severity": "error",
        "message": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)"
      }
    ]
  },
  "runAtMs": <ts>,
  "durationMs": 78628,
  "nextRunAtMs": <ts>
}

Notable: no model, provider, usage, or sessionId fields on failed runs — these are present on every successful run. The agent turn was never reached.

durationMs on failures clusters around 78–80s, suggesting an internal phase timeout (single observed value to date — would appreciate confirmation of what that timeout is).

Frequency (5-day window: 2026-05-22 → 2026-05-27)

Counted by grepping "status":"error" lines mentioning runtime-plugins / stalled before execution in each cron's run log under <workspace>/cron/runs/:

Cron Cadence Failed runs Most recent failure
Hourly cron A hourly 77 2026-05-27 12:23 UTC
Hourly cron B hourly 74 2026-05-27 14:15 UTC
Hourly cron C hourly 12 recent
Hourly cron D event-driven 11 recent
Misc daily/weekly crons 1–2 each various

Crons A and B each see roughly one failure every other hour. Because the daily health-audit cron flags only "last run failed" (not "any run in the last 24h failed"), the operator-visible signal under-reports: the morning digest shows 5 failing crons, but the underlying run logs show ~180 total failed runs across the fleet in 5 days.

Affected crons span unrelated payload types, tool surfaces (some have payload.toolsAllow locks, some don't), and models — the only common factor is sessionTarget: "isolated".

Hypothesis (low-confidence, leave for upstream)

The runtime-plugins phase appears to load MCP plugins / runtime adapters for the isolated child session. Either:

  1. A specific plugin's init is hanging (network call without timeout, file handle contention, mutex), and the phase has no per-plugin timeout — so the whole phase times out generically after ~78s with no indication of which plugin stalled.
  2. There's a race in the plugin-loader's startup ordering when multiple isolated sessions spin up concurrently (hourly crons firing at :00 of every hour would line up several at once).

Both are consistent with the observed mix of green and red runs on the same cron. I can't narrow further from the operator side because the diagnostic doesn't name a plugin.

What's wrong

1. The diagnostic is unactionable

Knowing the phase failed isn't enough to fix anything. There are presumably N plugins/adapters loaded per isolated session — one of them is the culprit, but the operator (and downstream tooling) can't tell which. The minimum useful diagnostic would include:

  • The list of plugins the phase was attempting to load.
  • Which ones completed before the stall.
  • The plugin that was in-flight when the phase timed out.

2. No per-plugin timeout (apparent)

If the phase aggregate timeout is ~78s and individual plugins have no timeouts of their own, a single slow plugin (e.g. an MCP server doing a network call without a timeout of its own) takes down the whole phase. A bounded per-plugin init budget would isolate the failure to the offending plugin and let the rest of the phase succeed.

3. No retry-on-phase-stall

A transient stall in plugin init kills the entire cron run — the operator's only recourse is to wait for the next scheduled run. A bounded retry on phase-stall (configurable, default off) would significantly reduce user-visible failures for transient races without papering over real bugs.

4. Under-reporting in cron status

The cron state machine records lastRunStatus only. A cron that fails 3 of 5 runs in an hour and succeeds on the 4th shows lastRunStatus: ok — and the daily health audit treats it as healthy. Adding a recentRunFailureRate (or similar) over a rolling window would surface chronic-but-self-recovering failures like this one without requiring operators to grep run logs by hand.

Asks

  1. Name the failing plugin. Add per-plugin diagnostics to the runtime-plugins phase so the in-flight plugin at stall time appears in the error. This is the single highest-value fix.
  2. Add per-plugin init timeouts. Bound each plugin's init independently so a single slow plugin can't take down the whole phase.
  3. Consider retry-on-phase-stall. Configurable, default off. Helps transient-race scenarios without masking real bugs.
  4. Expose a failure-rate metric on cron state. lastRunStatus alone hides chronic flaky crons that intermittently self-recover.
  5. Confirm relationship to [Bug]: MissingAgentHarnessError on inbound dispatch under event-loop starvation — harness IS registered (2026.5.22, distinct root cause from #86227) #86239. Same family (failures before agent turn starts) but a distinct error surface. Likely separate root causes but worth a sanity check from someone with the gateway internals in head.

Related


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions