cron: allow retries for local model preflight#82145
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 8, 2026, 10:31 AM ET / 14:31 UTC. Summary PR surface: Source +132, Tests +238, Docs +16. Total +386 across 12 files. Reproducibility: yes. at source level: current main and the latest release have a single 2500ms preflight probe and no retry-window config, so a provider that wakes shortly after that probe can still be missed. I did not run a live repro in this read-only review. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the bounded retry window only if maintainers accept the new cron config surface, keeping defaults unchanged and preserving the shared-chain budget, schema validation, docs, and focused fallback coverage. Do we have a high-confidence way to reproduce the issue? Yes, at source level: current main and the latest release have a single 2500ms preflight probe and no retry-window config, so a provider that wakes shortly after that probe can still be missed. I did not run a live repro in this read-only review. Is this the best way to solve the issue? Yes, if maintainers accept the config surface: the PR preserves default behavior, validates the retry window, and now budgets retries across the whole fallback candidate chain instead of per endpoint. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 6d7eb9bb8483. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +132, Tests +238, Docs +16. Total +386 across 12 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
b3d8d32 to
4aaae6b
Compare
4aaae6b to
a02407e
Compare
|
@clawsweeper re-review Addressed the chain-budget finding in a02407e: the fallback walk now shares one 55s deadline, each probe/delay is clamped to the remaining budget, and multi-local fallback regression coverage was added. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
Summary
cron.modelPreflight.timeoutMs,maxAttempts, andretryDelayMsWhy
Isolated cron jobs run a lightweight preflight before starting an agent turn for local/private model providers. Current upstream can advance from an unreachable local primary to configured fallback model candidates, but each local candidate still gets only one default 2.5s preflight probe.
That still leaves the cold-start/sleeping-provider case: a local endpoint can become healthy moments after the single probe fails. This PR keeps the fallback behavior intact while letting operators configure a short wake-up window before cron advances to a fallback or marks the run skipped.
The configured per-endpoint retry window is validated at or below 55s, and the runtime now shares one 55s deadline across the complete preflight candidate chain so multi-local fallback paths stay below cron's isolated-agent setup watchdog.
Review fix
After ClawSweeper noted that the original 55s cap applied per endpoint, this branch now:
Supplemental verification
corepack pnpm install --frozen-lockfilenode scripts/run-vitest.mjs src/cron/isolated-agent/model-preflight.runtime.test.ts src/cron/isolated-agent.model-preflight.test.ts src/cron/isolated-agent/run-fallback-policy.test.ts src/config/config-misc.test.ts- 113 tests passed across focused config and cron shardsnode scripts/run-vitest.mjs src/commands/agent-via-gateway.test.ts- 55 tests passed, covering the previously failing unrelated CI shard locallypnpm lint --threads=8- passedpnpm exec oxfmt --check docs/cli/cron.md docs/gateway/configuration-reference.md docs/providers/ollama.md src/config/config-misc.test.ts src/config/schema.help.ts src/config/schema.labels.ts src/config/types.cron.ts src/config/zod-schema.ts src/cron/isolated-agent/model-preflight.runtime.test.ts src/cron/isolated-agent/model-preflight.runtime.ts src/cron/isolated-agent.model-preflight.test.ts src/cron/isolated-agent/run.ts- passedgit diff --check- passedpnpm tsgo:core- passedpnpm checkcurrently passes all guards except the localnpm-shrinkwrap.json is staleguard. The ready-for-review CI run on this branch previously passed the shrinkwrap check, so I did not include unrelated generated dependency metadata in this cron patch.Real behavior proof
Behavior addressed: A sleeping or cold-starting local model provider can miss cron's default single 2.5s preflight probe even though it becomes reachable moments later. Configured retries should let that same local candidate wake before cron advances to a fallback or skips the run.
Real environment tested: macOS local OpenClaw source checkout rebased onto upstream
main, Node v22.22.0, pnpm v11.2.2. A real loopback OpenAI-compatible HTTP endpoint was started after an 800ms delay. The actualpreflightCronModelProviderruntime used its real network fetch and SSRF-guard path againsthttp://127.0.0.1:<port>/v1/models.Exact steps or command run after this patch: Ran a
pnpm exec tsx -elive harness that started the local HTTP endpoint after 800ms and invokedpreflightCronModelProvidertwice: first with default settings, then with{ timeoutMs: 500, maxAttempts: 2, retryDelayMs: 1000 }.Evidence after fix: Copied live console output from the real delayed endpoint run:
{"label":"default-single-attempt","result":{"status":"unavailable","provider":"delayed","model":"delayed-local-model","baseUrl":"http://127.0.0.1:54323/v1","retryAfterMs":300000,"reason":"Agent cron job uses delayed/delayed-local-model but the local provider endpoint is not reachable at http://127.0.0.1:54323/v1. Skipping this cron run after 1 preflight attempt; OpenClaw will retry the provider preflight on a later scheduled run. Last error: TypeError: fetch failed"},"elapsedMs":159,"serverHits":0} {"label":"configured-retry-window","result":{"status":"available"},"elapsedMs":1019,"serverHits":1}Observed result after fix: With defaults, the endpoint was reported unavailable before it woke, confirming the cold-start issue still exists. With the configured retry window, the first connection failed while the endpoint was offline, cron waited, the endpoint started, and the next probe succeeded with
status: "available"and one observed/modelsrequest.What was not tested: A separate real Ollama daemon was not stopped and restarted during this run. The proof used a real delayed local OpenAI-compatible endpoint to exercise the same production preflight runtime and network path deterministically.