You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug]: Isolated agent runs stall in runtime-plugins phase before execution start — recurring across hourly crons on 2026.5.22, no named-plugin diagnostic #87327
Affected runs: all use sessionTarget: "isolated" and payload.kind: "agentTurn"
Models seen across affected runs: claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-7 (via anthropic/ and claude-cli/ prefixes) — failure is not model-specific
TL;DR
Cron-triggered isolated agent runs intermittently stall in the gateway's runtime-plugins phase, before the model is invoked. The failure terminates the run with a generic diagnostic that names only the phase, not the plugin that hung. Affected runs have no model, provider, or usage recorded — confirming the agent turn never started.
Pattern is recurring (not one-off) and post-2026.5.22-install: 4 distinct hourly crons have racked up 11–77 failed runs each over the 5 days since 2026-05-22. Other crons are hit 1–2 times. The same crons also have intermixed green runs in the same window, suggesting a race or external-dependency timeout rather than a deterministic config bug.
This is in the same family as #86239 (pre-turn dispatch failures) but a distinct error surface — that one throws MissingAgentHarnessError, this one names runtime-plugins as the stalled phase with no further detail.
Symptom
Failed isolated-session runs end with this diagnostic in the cron run log:
Notable: no model, provider, usage, or sessionId fields on failed runs — these are present on every successful run. The agent turn was never reached.
durationMs on failures clusters around 78–80s, suggesting an internal phase timeout (single observed value to date — would appreciate confirmation of what that timeout is).
Frequency (5-day window: 2026-05-22 → 2026-05-27)
Counted by grepping "status":"error" lines mentioning runtime-plugins / stalled before execution in each cron's run log under <workspace>/cron/runs/:
Cron
Cadence
Failed runs
Most recent failure
Hourly cron A
hourly
77
2026-05-27 12:23 UTC
Hourly cron B
hourly
74
2026-05-27 14:15 UTC
Hourly cron C
hourly
12
recent
Hourly cron D
event-driven
11
recent
Misc daily/weekly crons
—
1–2 each
various
Crons A and B each see roughly one failure every other hour. Because the daily health-audit cron flags only "last run failed" (not "any run in the last 24h failed"), the operator-visible signal under-reports: the morning digest shows 5 failing crons, but the underlying run logs show ~180 total failed runs across the fleet in 5 days.
Affected crons span unrelated payload types, tool surfaces (some have payload.toolsAllow locks, some don't), and models — the only common factor is sessionTarget: "isolated".
Hypothesis (low-confidence, leave for upstream)
The runtime-plugins phase appears to load MCP plugins / runtime adapters for the isolated child session. Either:
A specific plugin's init is hanging (network call without timeout, file handle contention, mutex), and the phase has no per-plugin timeout — so the whole phase times out generically after ~78s with no indication of which plugin stalled.
There's a race in the plugin-loader's startup ordering when multiple isolated sessions spin up concurrently (hourly crons firing at :00 of every hour would line up several at once).
Both are consistent with the observed mix of green and red runs on the same cron. I can't narrow further from the operator side because the diagnostic doesn't name a plugin.
What's wrong
1. The diagnostic is unactionable
Knowing the phase failed isn't enough to fix anything. There are presumably N plugins/adapters loaded per isolated session — one of them is the culprit, but the operator (and downstream tooling) can't tell which. The minimum useful diagnostic would include:
The list of plugins the phase was attempting to load.
Which ones completed before the stall.
The plugin that was in-flight when the phase timed out.
2. No per-plugin timeout (apparent)
If the phase aggregate timeout is ~78s and individual plugins have no timeouts of their own, a single slow plugin (e.g. an MCP server doing a network call without a timeout of its own) takes down the whole phase. A bounded per-plugin init budget would isolate the failure to the offending plugin and let the rest of the phase succeed.
3. No retry-on-phase-stall
A transient stall in plugin init kills the entire cron run — the operator's only recourse is to wait for the next scheduled run. A bounded retry on phase-stall (configurable, default off) would significantly reduce user-visible failures for transient races without papering over real bugs.
4. Under-reporting in cron status
The cron state machine records lastRunStatus only. A cron that fails 3 of 5 runs in an hour and succeeds on the 4th shows lastRunStatus: ok — and the daily health audit treats it as healthy. Adding a recentRunFailureRate (or similar) over a rolling window would surface chronic-but-self-recovering failures like this one without requiring operators to grep run logs by hand.
Asks
Name the failing plugin. Add per-plugin diagnostics to the runtime-plugins phase so the in-flight plugin at stall time appears in the error. This is the single highest-value fix.
Add per-plugin init timeouts. Bound each plugin's init independently so a single slow plugin can't take down the whole phase.
Consider retry-on-phase-stall. Configurable, default off. Helps transient-race scenarios without masking real bugs.
Expose a failure-rate metric on cron state.lastRunStatus alone hides chronic flaky crons that intermittently self-recover.
Environment
cliBackends.claude-clisessionTarget: "isolated"andpayload.kind: "agentTurn"claude-haiku-4-5,claude-sonnet-4-6,claude-opus-4-7(viaanthropic/andclaude-cli/prefixes) — failure is not model-specificTL;DR
Cron-triggered isolated agent runs intermittently stall in the gateway's
runtime-pluginsphase, before the model is invoked. The failure terminates the run with a generic diagnostic that names only the phase, not the plugin that hung. Affected runs have nomodel,provider, orusagerecorded — confirming the agent turn never started.Pattern is recurring (not one-off) and post-2026.5.22-install: 4 distinct hourly crons have racked up 11–77 failed runs each over the 5 days since 2026-05-22. Other crons are hit 1–2 times. The same crons also have intermixed green runs in the same window, suggesting a race or external-dependency timeout rather than a deterministic config bug.
This is in the same family as #86239 (pre-turn dispatch failures) but a distinct error surface — that one throws
MissingAgentHarnessError, this one namesruntime-pluginsas the stalled phase with no further detail.Symptom
Failed isolated-session runs end with this diagnostic in the cron run log:
{ "ts": <ts>, "action": "finished", "status": "error", "error": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)", "diagnostics": { "summary": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)", "entries": [ { "ts": <ts>, "source": "cron-setup", "severity": "error", "message": "cron: isolated agent run stalled before execution start (last phase: runtime-plugins)" } ] }, "runAtMs": <ts>, "durationMs": 78628, "nextRunAtMs": <ts> }Notable: no
model,provider,usage, orsessionIdfields on failed runs — these are present on every successful run. The agent turn was never reached.durationMson failures clusters around 78–80s, suggesting an internal phase timeout (single observed value to date — would appreciate confirmation of what that timeout is).Frequency (5-day window: 2026-05-22 → 2026-05-27)
Counted by grepping
"status":"error"lines mentioningruntime-plugins/stalled before executionin each cron's run log under<workspace>/cron/runs/:Crons A and B each see roughly one failure every other hour. Because the daily health-audit cron flags only "last run failed" (not "any run in the last 24h failed"), the operator-visible signal under-reports: the morning digest shows 5 failing crons, but the underlying run logs show ~180 total failed runs across the fleet in 5 days.
Affected crons span unrelated payload types, tool surfaces (some have
payload.toolsAllowlocks, some don't), and models — the only common factor issessionTarget: "isolated".Hypothesis (low-confidence, leave for upstream)
The
runtime-pluginsphase appears to load MCP plugins / runtime adapters for the isolated child session. Either::00of every hour would line up several at once).Both are consistent with the observed mix of green and red runs on the same cron. I can't narrow further from the operator side because the diagnostic doesn't name a plugin.
What's wrong
1. The diagnostic is unactionable
Knowing the phase failed isn't enough to fix anything. There are presumably N plugins/adapters loaded per isolated session — one of them is the culprit, but the operator (and downstream tooling) can't tell which. The minimum useful diagnostic would include:
2. No per-plugin timeout (apparent)
If the phase aggregate timeout is ~78s and individual plugins have no timeouts of their own, a single slow plugin (e.g. an MCP server doing a network call without a timeout of its own) takes down the whole phase. A bounded per-plugin init budget would isolate the failure to the offending plugin and let the rest of the phase succeed.
3. No retry-on-phase-stall
A transient stall in plugin init kills the entire cron run — the operator's only recourse is to wait for the next scheduled run. A bounded retry on phase-stall (configurable, default off) would significantly reduce user-visible failures for transient races without papering over real bugs.
4. Under-reporting in cron status
The cron state machine records
lastRunStatusonly. A cron that fails 3 of 5 runs in an hour and succeeds on the 4th showslastRunStatus: ok— and the daily health audit treats it as healthy. Adding arecentRunFailureRate(or similar) over a rolling window would surface chronic-but-self-recovering failures like this one without requiring operators to grep run logs by hand.Asks
runtime-pluginsphase so the in-flight plugin at stall time appears in the error. This is the single highest-value fix.lastRunStatusalone hides chronic flaky crons that intermittently self-recover.Related
MissingAgentHarnessErroron inbound dispatch under event-loop starvation. Same family (pre-turn failure), different error string, different code path. Open.