cron: allow retries for local model preflight by cthornsburg · Pull Request #82145 · openclaw/openclaw

cthornsburg · 2026-05-15T12:51:54Z

Summary

add configurable retries for isolated cron local-model provider preflight
expose cron.modelPreflight.timeoutMs, maxAttempts, and retryDelayMs
document the settings for sleeping/cold-starting Ollama, vLLM, SGLang, LM Studio, and other local providers
keep defaults unchanged: one 2500ms probe and no retry delay
enforce one shared 55s preflight deadline across the full fallback candidate chain

Why

Isolated cron jobs run a lightweight preflight before starting an agent turn for local/private model providers. Current upstream can advance from an unreachable local primary to configured fallback model candidates, but each local candidate still gets only one default 2.5s preflight probe.

That still leaves the cold-start/sleeping-provider case: a local endpoint can become healthy moments after the single probe fails. This PR keeps the fallback behavior intact while letting operators configure a short wake-up window before cron advances to a fallback or marks the run skipped.

The configured per-endpoint retry window is validated at or below 55s, and the runtime now shares one 55s deadline across the complete preflight candidate chain so multi-local fallback paths stay below cron's isolated-agent setup watchdog.

Review fix

After ClawSweeper noted that the original 55s cap applied per endpoint, this branch now:

computes one preflight deadline before walking candidates
passes that same deadline to every candidate preflight
clamps each local probe timeout and retry delay to the remaining shared budget
skips additional local endpoint probes once the shared budget is exhausted
still allows remote/cloud fallback candidates to be selected immediately

Supplemental verification

corepack pnpm install --frozen-lockfile
node scripts/run-vitest.mjs src/cron/isolated-agent/model-preflight.runtime.test.ts src/cron/isolated-agent.model-preflight.test.ts src/cron/isolated-agent/run-fallback-policy.test.ts src/config/config-misc.test.ts - 113 tests passed across focused config and cron shards
node scripts/run-vitest.mjs src/commands/agent-via-gateway.test.ts - 55 tests passed, covering the previously failing unrelated CI shard locally
pnpm lint --threads=8 - passed
pnpm exec oxfmt --check docs/cli/cron.md docs/gateway/configuration-reference.md docs/providers/ollama.md src/config/config-misc.test.ts src/config/schema.help.ts src/config/schema.labels.ts src/config/types.cron.ts src/config/zod-schema.ts src/cron/isolated-agent/model-preflight.runtime.test.ts src/cron/isolated-agent/model-preflight.runtime.ts src/cron/isolated-agent.model-preflight.test.ts src/cron/isolated-agent/run.ts - passed
git diff --check - passed
pnpm tsgo:core - passed

pnpm check currently passes all guards except the local npm-shrinkwrap.json is stale guard. The ready-for-review CI run on this branch previously passed the shrinkwrap check, so I did not include unrelated generated dependency metadata in this cron patch.

Real behavior proof

Behavior addressed: A sleeping or cold-starting local model provider can miss cron's default single 2.5s preflight probe even though it becomes reachable moments later. Configured retries should let that same local candidate wake before cron advances to a fallback or skips the run.

Real environment tested: macOS local OpenClaw source checkout rebased onto upstream main, Node v22.22.0, pnpm v11.2.2. A real loopback OpenAI-compatible HTTP endpoint was started after an 800ms delay. The actual preflightCronModelProvider runtime used its real network fetch and SSRF-guard path against http://127.0.0.1:<port>/v1/models.

Exact steps or command run after this patch: Ran a pnpm exec tsx -e live harness that started the local HTTP endpoint after 800ms and invoked preflightCronModelProvider twice: first with default settings, then with { timeoutMs: 500, maxAttempts: 2, retryDelayMs: 1000 }.

Evidence after fix: Copied live console output from the real delayed endpoint run:

{"label":"default-single-attempt","result":{"status":"unavailable","provider":"delayed","model":"delayed-local-model","baseUrl":"http://127.0.0.1:54323/v1","retryAfterMs":300000,"reason":"Agent cron job uses delayed/delayed-local-model but the local provider endpoint is not reachable at http://127.0.0.1:54323/v1. Skipping this cron run after 1 preflight attempt; OpenClaw will retry the provider preflight on a later scheduled run. Last error: TypeError: fetch failed"},"elapsedMs":159,"serverHits":0}
{"label":"configured-retry-window","result":{"status":"available"},"elapsedMs":1019,"serverHits":1}

Observed result after fix: With defaults, the endpoint was reported unavailable before it woke, confirming the cold-start issue still exists. With the configured retry window, the first connection failed while the endpoint was offline, cron waited, the endpoint started, and the next probe succeeded with status: "available" and one observed /models request.

What was not tested: A separate real Ollama daemon was not stopped and restarted during this run. The proof used a real delayed local OpenAI-compatible endpoint to exercise the same production preflight runtime and network path deterministically.

clawsweeper · 2026-05-15T12:55:11Z

Codex review: needs maintainer review before merge. Reviewed June 8, 2026, 10:31 AM ET / 14:31 UTC.

Summary
The branch adds optional cron.modelPreflight timeout, attempt, and delay settings, applies them to isolated cron local-provider preflight with one shared candidate-chain deadline, and updates schema, docs, and tests.

PR surface: Source +132, Tests +238, Docs +16. Total +386 across 12 files.

Reproducibility: yes. at source level: current main and the latest release have a single 2500ms preflight probe and no retry-window config, so a provider that wakes shortly after that probe can still be missed. I did not run a live repro in this read-only review.

Review metrics: 1 noteworthy metric.

Cron config surface: 3 optional fields added. timeoutMs, maxAttempts, and retryDelayMs are operator-facing settings that affect setup latency and fallback timing before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

none.

Risk before merge

[P1] The branch adds three operator-facing cron config keys, so maintainers should explicitly accept the new config contract rather than treating this as a purely internal retry tweak.
[P2] Configured retries can intentionally keep isolated cron setup waiting longer before fallback or skip; the shared 55s chain budget bounds the availability risk, but it is still an operator-visible timing change.

Maintainer options:

Accept the bounded config surface (recommended)
Maintainers can accept the new optional cron preflight settings because defaults remain unchanged and the branch now enforces one shared 55s candidate-chain budget.
Hold for a no-knob design
If the project does not want another cron config surface, pause this PR and ask for a behavior-only design using existing scheduling or fallback policy.

Next step before merge

[P2] No narrow automated repair remains after the chain-budget fix; the remaining action is maintainer acceptance or rejection of the new cron config surface and normal merge gates.

Security
Cleared: The diff changes cron runtime, config schema/help/labels, docs, and tests only; it does not add dependency, secret, CI, install, or package execution surface.

Review details

Best possible solution:

Land the bounded retry window only if maintainers accept the new cron config surface, keeping defaults unchanged and preserving the shared-chain budget, schema validation, docs, and focused fallback coverage.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main and the latest release have a single 2500ms preflight probe and no retry-window config, so a provider that wakes shortly after that probe can still be missed. I did not run a live repro in this read-only review.

Is this the best way to solve the issue?

Yes, if maintainers accept the config surface: the PR preserves default behavior, validates the retry window, and now budgets retries across the whole fallback candidate chain instead of per endpoint.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 6d7eb9bb8483.

Label changes

Label changes:

add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.
add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.
remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

P2: This is a bounded cron/provider reliability improvement with config and availability implications but no emergency blast radius.
merge-risk: 🚨 compatibility: The PR adds new validated cron.modelPreflight keys that become part of the operator-facing config contract.
merge-risk: 🚨 availability: The retry window can deliberately keep isolated cron setup waiting longer before fallback or skip when local providers are unavailable.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.

Evidence reviewed

PR surface:

Source +132, Tests +238, Docs +16. Total +386 across 12 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	6	143	11	+132
Tests	3	239	1	+238
Docs	3	21	5	+16
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	12	403	17	+386

What I checked:

Repository policy read: Root and docs scoped review guidance were read; the config-surface compatibility guidance applies because the PR adds cron config keys and changes preflight timing. (AGENTS.md:1, 6d7eb9bb8483)
Actual PR diff scope: Using the PR merge base avoids counting newer unrelated main commits; the branch changes cron preflight/config/docs/tests only, with 12 files in the actual PR diff. (a02407e1d1fd)
Current main behavior: Current main has no cron.modelPreflight knobs and still performs a single local-provider endpoint probe before returning unavailable or cached unavailable. (src/cron/isolated-agent/model-preflight.runtime.ts:181, 6d7eb9bb8483)
Latest release behavior: The latest release tag v2026.6.1 also lacks the retry config and uses the single preflight probe path, so the requested behavior is not already shipped. (src/cron/isolated-agent/model-preflight.runtime.ts:181, 2e08f0f4221f)
PR runtime implementation: The PR head reads cron.modelPreflight, normalizes attempts/timeouts/delay, clamps each probe or sleep to the remaining deadline, and records attempts in unavailable diagnostics. (src/cron/isolated-agent/model-preflight.runtime.ts:234, a02407e1d1fd)
Shared chain budget: The current PR head computes one preflight deadline before walking candidates and passes that same deadline to each candidate preflight call. (src/cron/isolated-agent/run.ts:666, a02407e1d1fd)

Likely related people:

chen-zhang-cs-code: The merged fallback fix at fix(cron): preflight model fallbacks before skip #82887 introduced cron preflight candidate walking that this PR extends with retries. (role: introduced related fallback-chain behavior; confidence: high; commits: 7a381b807e23; files: src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-fallback-policy.ts, src/agents/model-fallback.ts)
osolmaz: The same fallback-chain commit lists Onur Solmaz as co-author, and the PR timeline shows assignment to osolmaz for this cron preflight area. (role: adjacent owner by co-authored fallback work and assignment signal; confidence: medium; commits: 7a381b807e23; files: src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-fallback-policy.ts)
Yzx: A recent current-main cron isolated-agent change touched src/cron/isolated-agent/run.ts, making this a useful routing candidate for adjacent cron setup behavior. (role: recent area contributor; confidence: medium; commits: 4780546c124d; files: src/cron/isolated-agent/run.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

cthornsburg · 2026-06-08T14:20:35Z

@clawsweeper re-review

Addressed the chain-budget finding in a02407e: the fallback walk now shares one 55s deadline, each probe/delay is clamped to the remaining budget, and multi-local fallback regression coverage was added.

clawsweeper · 2026-06-08T14:20:38Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/27144221828
Updated: 2026-06-08T14:32:07.037Z

openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 15, 2026

clawsweeper Bot mentioned this pull request May 15, 2026

Cron model preflight skips entire run when local primary is unreachable, ignoring configured cloud fallbacks [AI] #79329

Closed

openclaw-barnacle Bot added size: M and removed size: S labels May 16, 2026

clawsweeper Bot mentioned this pull request May 16, 2026

Unify model routing, preflight, and fallback behavior across execution paths #82629

Closed

chen-zhang-cs-code mentioned this pull request May 17, 2026

fix(cron): preflight model fallbacks before skip #82887

Merged

osolmaz self-assigned this May 28, 2026

cthornsburg force-pushed the feat/cron-local-preflight-retry branch from b3d8d32 to 4aaae6b Compare June 8, 2026 13:11

openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 8, 2026

cthornsburg marked this pull request as ready for review June 8, 2026 13:21

Cleo Thornsburg added 3 commits June 8, 2026 09:10

cron: allow retries for local model preflight

54ba164

fix: bound cron model preflight retry window

c848bee

fix: share cron preflight budget across fallbacks

a02407e

cthornsburg force-pushed the feat/cron-local-preflight-retry branch from 4aaae6b to a02407e Compare June 8, 2026 14:19

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 8, 2026

clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cron: allow retries for local model preflight#82145

cron: allow retries for local model preflight#82145
cthornsburg wants to merge 3 commits into
openclaw:mainfrom
cthornsburg:feat/cron-local-preflight-retry

cthornsburg commented May 15, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cthornsburg commented Jun 8, 2026

Uh oh!

clawsweeper Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cthornsburg commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Review fix

Supplemental verification

Real behavior proof

Uh oh!

clawsweeper Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cthornsburg commented Jun 8, 2026

Uh oh!

clawsweeper Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cthornsburg commented May 15, 2026 •

edited

Loading

clawsweeper Bot commented May 15, 2026 •

edited

Loading

clawsweeper Bot commented Jun 8, 2026 •

edited

Loading