Skip to content

cron: allow retries for local model preflight#82145

Open
cthornsburg wants to merge 3 commits into
openclaw:mainfrom
cthornsburg:feat/cron-local-preflight-retry
Open

cron: allow retries for local model preflight#82145
cthornsburg wants to merge 3 commits into
openclaw:mainfrom
cthornsburg:feat/cron-local-preflight-retry

Conversation

@cthornsburg

@cthornsburg cthornsburg commented May 15, 2026

Copy link
Copy Markdown

Summary

  • add configurable retries for isolated cron local-model provider preflight
  • expose cron.modelPreflight.timeoutMs, maxAttempts, and retryDelayMs
  • document the settings for sleeping/cold-starting Ollama, vLLM, SGLang, LM Studio, and other local providers
  • keep defaults unchanged: one 2500ms probe and no retry delay
  • enforce one shared 55s preflight deadline across the full fallback candidate chain

Why

Isolated cron jobs run a lightweight preflight before starting an agent turn for local/private model providers. Current upstream can advance from an unreachable local primary to configured fallback model candidates, but each local candidate still gets only one default 2.5s preflight probe.

That still leaves the cold-start/sleeping-provider case: a local endpoint can become healthy moments after the single probe fails. This PR keeps the fallback behavior intact while letting operators configure a short wake-up window before cron advances to a fallback or marks the run skipped.

The configured per-endpoint retry window is validated at or below 55s, and the runtime now shares one 55s deadline across the complete preflight candidate chain so multi-local fallback paths stay below cron's isolated-agent setup watchdog.

Review fix

After ClawSweeper noted that the original 55s cap applied per endpoint, this branch now:

  • computes one preflight deadline before walking candidates
  • passes that same deadline to every candidate preflight
  • clamps each local probe timeout and retry delay to the remaining shared budget
  • skips additional local endpoint probes once the shared budget is exhausted
  • still allows remote/cloud fallback candidates to be selected immediately

Supplemental verification

  • corepack pnpm install --frozen-lockfile
  • node scripts/run-vitest.mjs src/cron/isolated-agent/model-preflight.runtime.test.ts src/cron/isolated-agent.model-preflight.test.ts src/cron/isolated-agent/run-fallback-policy.test.ts src/config/config-misc.test.ts - 113 tests passed across focused config and cron shards
  • node scripts/run-vitest.mjs src/commands/agent-via-gateway.test.ts - 55 tests passed, covering the previously failing unrelated CI shard locally
  • pnpm lint --threads=8 - passed
  • pnpm exec oxfmt --check docs/cli/cron.md docs/gateway/configuration-reference.md docs/providers/ollama.md src/config/config-misc.test.ts src/config/schema.help.ts src/config/schema.labels.ts src/config/types.cron.ts src/config/zod-schema.ts src/cron/isolated-agent/model-preflight.runtime.test.ts src/cron/isolated-agent/model-preflight.runtime.ts src/cron/isolated-agent.model-preflight.test.ts src/cron/isolated-agent/run.ts - passed
  • git diff --check - passed
  • pnpm tsgo:core - passed

pnpm check currently passes all guards except the local npm-shrinkwrap.json is stale guard. The ready-for-review CI run on this branch previously passed the shrinkwrap check, so I did not include unrelated generated dependency metadata in this cron patch.

Real behavior proof

Behavior addressed: A sleeping or cold-starting local model provider can miss cron's default single 2.5s preflight probe even though it becomes reachable moments later. Configured retries should let that same local candidate wake before cron advances to a fallback or skips the run.

Real environment tested: macOS local OpenClaw source checkout rebased onto upstream main, Node v22.22.0, pnpm v11.2.2. A real loopback OpenAI-compatible HTTP endpoint was started after an 800ms delay. The actual preflightCronModelProvider runtime used its real network fetch and SSRF-guard path against http://127.0.0.1:<port>/v1/models.

Exact steps or command run after this patch: Ran a pnpm exec tsx -e live harness that started the local HTTP endpoint after 800ms and invoked preflightCronModelProvider twice: first with default settings, then with { timeoutMs: 500, maxAttempts: 2, retryDelayMs: 1000 }.

Evidence after fix: Copied live console output from the real delayed endpoint run:

{"label":"default-single-attempt","result":{"status":"unavailable","provider":"delayed","model":"delayed-local-model","baseUrl":"http://127.0.0.1:54323/v1","retryAfterMs":300000,"reason":"Agent cron job uses delayed/delayed-local-model but the local provider endpoint is not reachable at http://127.0.0.1:54323/v1. Skipping this cron run after 1 preflight attempt; OpenClaw will retry the provider preflight on a later scheduled run. Last error: TypeError: fetch failed"},"elapsedMs":159,"serverHits":0}
{"label":"configured-retry-window","result":{"status":"available"},"elapsedMs":1019,"serverHits":1}

Observed result after fix: With defaults, the endpoint was reported unavailable before it woke, confirming the cold-start issue still exists. With the configured retry window, the first connection failed while the endpoint was offline, cron waited, the endpoint started, and the next probe succeeded with status: "available" and one observed /models request.

What was not tested: A separate real Ollama daemon was not stopped and restarted during this run. The proof used a real delayed local OpenAI-compatible endpoint to exercise the same production preflight runtime and network path deterministically.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 15, 2026
@clawsweeper

clawsweeper Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 8, 2026, 10:31 AM ET / 14:31 UTC.

Summary
The branch adds optional cron.modelPreflight timeout, attempt, and delay settings, applies them to isolated cron local-provider preflight with one shared candidate-chain deadline, and updates schema, docs, and tests.

PR surface: Source +132, Tests +238, Docs +16. Total +386 across 12 files.

Reproducibility: yes. at source level: current main and the latest release have a single 2500ms preflight probe and no retry-window config, so a provider that wakes shortly after that probe can still be missed. I did not run a live repro in this read-only review.

Review metrics: 1 noteworthy metric.

  • Cron config surface: 3 optional fields added. timeoutMs, maxAttempts, and retryDelayMs are operator-facing settings that affect setup latency and fallback timing before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] The branch adds three operator-facing cron config keys, so maintainers should explicitly accept the new config contract rather than treating this as a purely internal retry tweak.
  • [P2] Configured retries can intentionally keep isolated cron setup waiting longer before fallback or skip; the shared 55s chain budget bounds the availability risk, but it is still an operator-visible timing change.

Maintainer options:

  1. Accept the bounded config surface (recommended)
    Maintainers can accept the new optional cron preflight settings because defaults remain unchanged and the branch now enforces one shared 55s candidate-chain budget.
  2. Hold for a no-knob design
    If the project does not want another cron config surface, pause this PR and ask for a behavior-only design using existing scheduling or fallback policy.

Next step before merge

  • [P2] No narrow automated repair remains after the chain-budget fix; the remaining action is maintainer acceptance or rejection of the new cron config surface and normal merge gates.

Security
Cleared: The diff changes cron runtime, config schema/help/labels, docs, and tests only; it does not add dependency, secret, CI, install, or package execution surface.

Review details

Best possible solution:

Land the bounded retry window only if maintainers accept the new cron config surface, keeping defaults unchanged and preserving the shared-chain budget, schema validation, docs, and focused fallback coverage.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main and the latest release have a single 2500ms preflight probe and no retry-window config, so a provider that wakes shortly after that probe can still be missed. I did not run a live repro in this read-only review.

Is this the best way to solve the issue?

Yes, if maintainers accept the config surface: the PR preserves default behavior, validates the retry window, and now budgets retries across the whole fallback candidate chain instead of per endpoint.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 6d7eb9bb8483.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.
  • remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P2: This is a bounded cron/provider reliability improvement with config and availability implications but no emergency blast radius.
  • merge-risk: 🚨 compatibility: The PR adds new validated cron.modelPreflight keys that become part of the operator-facing config contract.
  • merge-risk: 🚨 availability: The retry window can deliberately keep isolated cron setup waiting longer before fallback or skip when local providers are unavailable.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix live console output from a real delayed loopback endpoint using the production preflight runtime and SSRF-guard fetch path.
Evidence reviewed

PR surface:

Source +132, Tests +238, Docs +16. Total +386 across 12 files.

View PR surface stats
Area Files Added Removed Net
Source 6 143 11 +132
Tests 3 239 1 +238
Docs 3 21 5 +16
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 12 403 17 +386

What I checked:

  • Repository policy read: Root and docs scoped review guidance were read; the config-surface compatibility guidance applies because the PR adds cron config keys and changes preflight timing. (AGENTS.md:1, 6d7eb9bb8483)
  • Actual PR diff scope: Using the PR merge base avoids counting newer unrelated main commits; the branch changes cron preflight/config/docs/tests only, with 12 files in the actual PR diff. (a02407e1d1fd)
  • Current main behavior: Current main has no cron.modelPreflight knobs and still performs a single local-provider endpoint probe before returning unavailable or cached unavailable. (src/cron/isolated-agent/model-preflight.runtime.ts:181, 6d7eb9bb8483)
  • Latest release behavior: The latest release tag v2026.6.1 also lacks the retry config and uses the single preflight probe path, so the requested behavior is not already shipped. (src/cron/isolated-agent/model-preflight.runtime.ts:181, 2e08f0f4221f)
  • PR runtime implementation: The PR head reads cron.modelPreflight, normalizes attempts/timeouts/delay, clamps each probe or sleep to the remaining deadline, and records attempts in unavailable diagnostics. (src/cron/isolated-agent/model-preflight.runtime.ts:234, a02407e1d1fd)
  • Shared chain budget: The current PR head computes one preflight deadline before walking candidates and passes that same deadline to each candidate preflight call. (src/cron/isolated-agent/run.ts:666, a02407e1d1fd)

Likely related people:

  • chen-zhang-cs-code: The merged fallback fix at fix(cron): preflight model fallbacks before skip #82887 introduced cron preflight candidate walking that this PR extends with retries. (role: introduced related fallback-chain behavior; confidence: high; commits: 7a381b807e23; files: src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-fallback-policy.ts, src/agents/model-fallback.ts)
  • osolmaz: The same fallback-chain commit lists Onur Solmaz as co-author, and the PR timeline shows assignment to osolmaz for this cron preflight area. (role: adjacent owner by co-authored fallback work and assignment signal; confidence: medium; commits: 7a381b807e23; files: src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-fallback-policy.ts)
  • Yzx: A recent current-main cron isolated-agent change touched src/cron/isolated-agent/run.ts, making this a useful routing candidate for adjacent cron setup behavior. (role: recent area contributor; confidence: medium; commits: 4780546c124d; files: src/cron/isolated-agent/run.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@osolmaz osolmaz self-assigned this May 28, 2026
@cthornsburg cthornsburg force-pushed the feat/cron-local-preflight-retry branch from b3d8d32 to 4aaae6b Compare June 8, 2026 13:11
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 8, 2026
@cthornsburg cthornsburg marked this pull request as ready for review June 8, 2026 13:21
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 8, 2026
@cthornsburg cthornsburg force-pushed the feat/cron-local-preflight-retry branch from 4aaae6b to a02407e Compare June 8, 2026 14:19
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label Jun 8, 2026
@cthornsburg

Copy link
Copy Markdown
Author

@clawsweeper re-review

Addressed the chain-budget finding in a02407e: the fallback walk now shares one 55s deadline, each probe/delay is clamped to the remaining budget, and multi-local fallback regression coverage was added.

@clawsweeper

clawsweeper Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels Jun 8, 2026
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation gateway Gateway runtime merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants