Skip to content

Commit a02407e

Browse files
Cleo ThornsburgCleo Thornsburg
authored andcommitted
fix: share cron preflight budget across fallbacks
1 parent c848bee commit a02407e

10 files changed

Lines changed: 191 additions & 17 deletions

File tree

docs/cli/cron.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ Recurring jobs use exponential retry backoff after consecutive errors: 30s, 1m,
141141

142142
Skipped runs are tracked separately from execution errors. They do not affect retry backoff, but `openclaw cron edit <job-id> --failure-alert-include-skipped` can opt failure alerts into repeated skipped-run notifications.
143143

144-
For isolated jobs that target a local configured model provider, cron runs a lightweight provider preflight before starting the agent turn. Loopback, private-network, and `.local` `api: "ollama"` providers are probed at `/api/tags`; local OpenAI-compatible providers such as vLLM, SGLang, and LM Studio are probed at `/models`. If an endpoint is unreachable after the configured attempts, cron advances to the next configured model fallback. The run is recorded as `skipped` and retried on a later schedule only when no candidate is reachable. Matching dead endpoints are cached for 5 minutes to avoid many jobs hammering the same local server. Tune `cron.modelPreflight.timeoutMs`, `cron.modelPreflight.maxAttempts`, and `cron.modelPreflight.retryDelayMs` when a sleeping local/LAN provider needs a short wake-up window before cron advances to a fallback or gives up. The worst-case preflight window is limited to 55s so it stays below cron's isolated-agent setup watchdog.
144+
For isolated jobs that target a local configured model provider, cron runs a lightweight provider preflight before starting the agent turn. Loopback, private-network, and `.local` `api: "ollama"` providers are probed at `/api/tags`; local OpenAI-compatible providers such as vLLM, SGLang, and LM Studio are probed at `/models`. If an endpoint is unreachable after the configured attempts, cron advances to the next configured model fallback. The run is recorded as `skipped` and retried on a later schedule only when no candidate is reachable. Matching dead endpoints are cached for 5 minutes to avoid many jobs hammering the same local server. Tune `cron.modelPreflight.timeoutMs`, `cron.modelPreflight.maxAttempts`, and `cron.modelPreflight.retryDelayMs` when a sleeping local/LAN provider needs a short wake-up window before cron advances to a fallback or gives up. The entire candidate-chain preflight is limited to 55s, and each probe or delay is clamped to the remaining budget so setup stays below cron's isolated-agent watchdog.
145145

146146
Note: cron jobs, pending runtime state, and run history live in the shared SQLite state database. Legacy `jobs.json`, `jobs-state.json`, and `runs/*.jsonl` files are imported once and renamed with a `.migrated` suffix. After import, edit schedules with `openclaw cron add|edit|remove` instead of editing JSON files.
147147

docs/gateway/configuration-reference.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1296,7 +1296,7 @@ Current builds no longer include the TCP bridge. Nodes connect over the Gateway
12961296
timeoutMs: 2500, // default per-attempt timeout
12971297
maxAttempts: 1, // default probe attempts before skipped
12981298
retryDelayMs: 0, // default delay between attempts
1299-
// worst-case window is limited to 55s so preflight stays below cron's setup watchdog
1299+
// the full candidate chain is limited to 55s by cron's setup watchdog budget
13001300
},
13011301
},
13021302
}
@@ -1305,7 +1305,7 @@ Current builds no longer include the TCP bridge. Nodes connect over the Gateway
13051305
- `sessionRetention`: how long to keep completed isolated cron run sessions before pruning from `sessions.json`. Also controls cleanup of archived deleted cron transcripts. Default: `24h`; set `false` to disable.
13061306
- `runLog.maxBytes`: accepted for compatibility with older file-backed cron run logs. Default: `2_000_000` bytes.
13071307
- `runLog.keepLines`: newest SQLite run-history rows retained per job. Default: `2000`.
1308-
- `modelPreflight`: local model-provider preflight controls for isolated cron agent turns. Increase `maxAttempts`, `retryDelayMs`, or `timeoutMs` when a sleeping Ollama/vLLM/LM Studio host needs a short wake-up window before cron advances to a configured fallback or marks the run skipped. The worst-case window (`timeoutMs * maxAttempts + retryDelayMs * (maxAttempts - 1)`) must stay at or below 55s so preflight remains below cron's isolated-agent setup watchdog.
1308+
- `modelPreflight`: local model-provider preflight controls for isolated cron agent turns. Increase `maxAttempts`, `retryDelayMs`, or `timeoutMs` when a sleeping Ollama/vLLM/LM Studio host needs a short wake-up window before cron advances to a configured fallback or marks the run skipped. The configured per-endpoint window (`timeoutMs * maxAttempts + retryDelayMs * (maxAttempts - 1)`) must stay at or below 55s. Cron also shares one 55s deadline across the complete candidate chain and clamps each probe or delay to the remaining budget.
13091309
- `webhookToken`: bearer token used for cron webhook POST delivery (`delivery.mode = "webhook"`), if omitted no auth header is sent.
13101310
- `webhook`: deprecated legacy fallback webhook URL (http/https) used by `openclaw doctor --fix` to migrate stored jobs that still have `notify: true`; runtime delivery uses per-job `delivery.mode="webhook"` plus `delivery.to`, or `delivery.completionDestination` when preserving announce delivery.
13111311

docs/providers/ollama.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -270,8 +270,9 @@ configure
270270
`cron.modelPreflight.maxAttempts`, `cron.modelPreflight.retryDelayMs`, and/or
271271
`cron.modelPreflight.timeoutMs` to give it a short wake-up window before cron
272272
advances to a fallback or marks the run skipped. Keep the worst-case window at
273-
or below 55s; OpenClaw validates this so local-provider preflight stays below
274-
cron's isolated-agent setup watchdog.
273+
or below 55s; OpenClaw validates the configured endpoint window and enforces one
274+
shared 55s deadline across the full fallback candidate chain so local-provider
275+
preflight stays below cron's isolated-agent setup watchdog.
275276

276277
Live-verify the local text path, native stream path, and embeddings against
277278
local Ollama with:

src/config/config-misc.test.ts

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1084,7 +1084,9 @@ describe("cron webhook schema", () => {
10841084
if (res.success) {
10851085
throw new Error("expected cron.modelPreflight retry window validation to fail");
10861086
}
1087-
expect(res.error.issues[0]?.message).toContain("total retry window must be <= 55000ms");
1087+
expect(res.error.issues[0]?.message).toContain(
1088+
"total retry window must be <= 55000ms per endpoint",
1089+
);
10881090
});
10891091
});
10901092

src/config/schema.help.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1717,7 +1717,7 @@ export const FIELD_HELP: Record<string, string> = {
17171717
"cron.runLog.keepLines":
17181718
"How many trailing run-history rows to retain per cron job (default `2000`). Increase for longer forensic history or lower for smaller disks.",
17191719
"cron.modelPreflight":
1720-
"Controls the lightweight local model-provider preflight used before isolated cron agent turns. Tune this when local or LAN providers such as Ollama need a few seconds to wake before /api/tags or /models responds. The total retry window is capped at 55s to stay below cron's setup watchdog.",
1720+
"Controls the lightweight local model-provider preflight used before isolated cron agent turns. Tune this when local or LAN providers such as Ollama need a few seconds to wake before /api/tags or /models responds. Each configured endpoint window and the complete fallback candidate chain are capped at 55s to stay below cron's setup watchdog.",
17211721
"cron.modelPreflight.timeoutMs":
17221722
"Per-attempt timeout in milliseconds for local model-provider preflight probes (default: 2500). Increase for slow LAN or cold-starting providers while keeping the total retry window <= 55s.",
17231723
"cron.modelPreflight.maxAttempts":

src/config/zod-schema.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -865,7 +865,7 @@ export const OpenClawSchema = z
865865
code: z.ZodIssueCode.custom,
866866
message:
867867
`cron.modelPreflight total retry window must be <= ${CRON_MODEL_PREFLIGHT_MAX_TOTAL_WINDOW_MS}ms ` +
868-
`so local-provider preflight stays below the cron agent setup watchdog; got ${totalWindowMs}ms.`,
868+
`per endpoint so it fits within the cron agent setup budget; got ${totalWindowMs}ms.`,
869869
});
870870
}
871871
})

src/cron/isolated-agent.model-preflight.test.ts

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,89 @@ describe("runCronIsolatedAgentTurn model provider preflight", () => {
168168
expect(String(logWarnMock.mock.calls[0]?.[0] ?? "")).not.toContain("Skipping this cron run");
169169
});
170170

171+
it("shares one preflight deadline across multiple local fallback candidates", async () => {
172+
mockRunCronFallbackPassthrough();
173+
preflightCronModelProviderMock
174+
.mockResolvedValueOnce({
175+
status: "unavailable",
176+
reason: "first local provider unavailable",
177+
provider: "ollama",
178+
model: "qwen3:32b",
179+
baseUrl: "http://127.0.0.1:11434",
180+
retryAfterMs: 300000,
181+
})
182+
.mockResolvedValueOnce({
183+
status: "unavailable",
184+
reason: "second local provider unavailable",
185+
provider: "vllm",
186+
model: "local-fallback",
187+
baseUrl: "http://127.0.0.1:8000/v1",
188+
retryAfterMs: 300000,
189+
})
190+
.mockResolvedValueOnce({ status: "available" });
191+
192+
const result = await runCronIsolatedAgentTurn({
193+
cfg: {
194+
agents: {
195+
defaults: {
196+
model: {
197+
primary: "ollama/qwen3:32b",
198+
fallbacks: ["vllm/local-fallback", "openrouter/cloud-fallback"],
199+
},
200+
},
201+
},
202+
models: {
203+
providers: {
204+
ollama: {
205+
api: "ollama",
206+
baseUrl: "http://127.0.0.1:11434",
207+
models: [],
208+
},
209+
vllm: {
210+
api: "openai-completions",
211+
baseUrl: "http://127.0.0.1:8000/v1",
212+
models: [],
213+
},
214+
openrouter: {
215+
api: "openai-completions",
216+
baseUrl: "https://openrouter.ai/api/v1",
217+
models: [],
218+
},
219+
},
220+
},
221+
},
222+
deps: {} as never,
223+
job: {
224+
id: "shared-preflight-budget",
225+
name: "Shared Preflight Budget",
226+
enabled: true,
227+
createdAtMs: 0,
228+
updatedAtMs: 0,
229+
schedule: { kind: "cron", expr: "*/5 * * * *", tz: "UTC" },
230+
sessionTarget: "isolated",
231+
state: {},
232+
wakeMode: "next-heartbeat",
233+
payload: { kind: "agentTurn", message: "summarize" },
234+
delivery: { mode: "none" },
235+
},
236+
message: "summarize",
237+
sessionKey: "cron:shared-preflight-budget",
238+
lane: "cron",
239+
});
240+
241+
expect(result.status).toBe("ok");
242+
expect(result.provider).toBe("openrouter");
243+
const preflightCalls = preflightCronModelProviderMock.mock.calls.map((call) => call[0]);
244+
expect(preflightCalls).toMatchObject([
245+
{ provider: "ollama", model: "qwen3:32b" },
246+
{ provider: "vllm", model: "local-fallback" },
247+
{ provider: "openrouter", model: "openrouter/cloud-fallback" },
248+
]);
249+
const deadlines = preflightCalls.map((call) => call.deadlineMs);
250+
expect(deadlines.every((deadline) => typeof deadline === "number")).toBe(true);
251+
expect(new Set(deadlines).size).toBe(1);
252+
});
253+
171254
it("keeps explicit empty payload fallbacks strict when local primary preflight fails", async () => {
172255
preflightCronModelProviderMock.mockResolvedValueOnce({
173256
status: "unavailable",

src/cron/isolated-agent/model-preflight.runtime.test.ts

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
// Runtime model preflight tests cover provider/model checks before cron execution.
2-
import { beforeEach, describe, expect, it, vi } from "vitest";
2+
import { afterEach, beforeEach, describe, expect, it, vi } from "vitest";
33

44
const { fetchWithSsrFGuardMock } = vi.hoisted(() => ({
55
fetchWithSsrFGuardMock: vi.fn(),
@@ -41,6 +41,10 @@ describe("preflightCronModelProvider", () => {
4141
resetCronModelProviderPreflightCacheForTest();
4242
});
4343

44+
afterEach(() => {
45+
vi.useRealTimers();
46+
});
47+
4448
it("skips network checks for cloud provider URLs", async () => {
4549
const result = await preflightCronModelProvider({
4650
cfg: {
@@ -206,6 +210,55 @@ describe("preflightCronModelProvider", () => {
206210
expect(fetchWithSsrFGuardMock).toHaveBeenCalledTimes(2);
207211
});
208212

213+
it("does not probe a second local candidate after the shared chain deadline expires", async () => {
214+
vi.useFakeTimers();
215+
vi.setSystemTime(1_000);
216+
fetchWithSsrFGuardMock.mockRejectedValueOnce(new Error("first endpoint unavailable"));
217+
218+
const cfg = {
219+
models: {
220+
providers: {
221+
first: {
222+
api: "openai-completions" as const,
223+
baseUrl: "http://127.0.0.1:18001/v1",
224+
models: [],
225+
},
226+
second: {
227+
api: "openai-completions" as const,
228+
baseUrl: "http://127.0.0.1:18002/v1",
229+
models: [],
230+
},
231+
},
232+
},
233+
};
234+
const deadlineMs = 1_200;
235+
236+
const first = await preflightCronModelProvider({
237+
cfg,
238+
provider: "first",
239+
model: "local-one",
240+
deadlineMs,
241+
});
242+
expect(first.status).toBe("unavailable");
243+
expect(fetchWithSsrFGuardMock).toHaveBeenCalledTimes(1);
244+
expect(requireFetchPreflightRequest().timeoutMs).toBe(200);
245+
246+
vi.setSystemTime(deadlineMs);
247+
const second = await preflightCronModelProvider({
248+
cfg,
249+
provider: "second",
250+
model: "local-two",
251+
deadlineMs,
252+
});
253+
254+
expect(second.status).toBe("unavailable");
255+
if (second.status !== "unavailable") {
256+
throw new Error(`expected second preflight unavailable, got ${second.status}`);
257+
}
258+
expect(second.reason).toContain("chain budget exhausted");
259+
expect(fetchWithSsrFGuardMock).toHaveBeenCalledTimes(1);
260+
});
261+
209262
it("retries an unavailable endpoint after the cache ttl", async () => {
210263
fetchWithSsrFGuardMock.mockRejectedValueOnce(new Error("ECONNREFUSED")).mockResolvedValueOnce({
211264
response: { status: 200 },

src/cron/isolated-agent/model-preflight.runtime.ts

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,13 @@ function sleepMs(delayMs: number): Promise<void> {
181181
if (delayMs <= 0) {
182182
return Promise.resolve();
183183
}
184-
return new Promise((resolve) => setTimeout(resolve, delayMs));
184+
return new Promise((resolve) => {
185+
setTimeout(resolve, delayMs);
186+
});
187+
}
188+
189+
function resolveRemainingBudgetMs(deadlineMs: number | undefined): number | undefined {
190+
return deadlineMs === undefined ? undefined : Math.max(0, deadlineMs - Date.now());
185191
}
186192

187193
async function probeLocalProviderEndpoint(params: {
@@ -212,6 +218,7 @@ export async function preflightCronModelProvider(params: {
212218
provider: string;
213219
model: string;
214220
nowMs?: number;
221+
deadlineMs?: number;
215222
}): Promise<CronModelProviderPreflightResult> {
216223
const providerConfig = resolveProviderConfig(params.cfg, params.provider);
217224
if (!providerConfig) {
@@ -253,23 +260,48 @@ export async function preflightCronModelProvider(params: {
253260
});
254261
}
255262

256-
let result: EndpointPreflightResult;
257263
let lastError: unknown;
258264
let attempts = 0;
259-
for (attempts = 1; attempts <= maxAttempts; attempts += 1) {
265+
for (let attempt = 1; attempt <= maxAttempts; attempt += 1) {
266+
const remainingBudgetMs = resolveRemainingBudgetMs(params.deadlineMs);
267+
if (remainingBudgetMs !== undefined && remainingBudgetMs <= 0) {
268+
lastError = new Error("cron model preflight chain budget exhausted");
269+
break;
270+
}
271+
attempts = attempt;
260272
try {
261-
await probeLocalProviderEndpoint({ api, baseUrl, timeoutMs });
262-
result = { status: "available" };
273+
await probeLocalProviderEndpoint({
274+
api,
275+
baseUrl,
276+
timeoutMs:
277+
remainingBudgetMs === undefined ? timeoutMs : Math.min(timeoutMs, remainingBudgetMs),
278+
});
279+
const result: EndpointPreflightResult = { status: "available" };
263280
preflightCache.set(cacheKey, { checkedAtMs: nowMs, result });
264281
return { status: "available" };
265282
} catch (error) {
266283
lastError = error;
267-
if (attempts < maxAttempts) {
268-
await sleepMs(retryDelayMs);
284+
if (attempt < maxAttempts) {
285+
const remainingDelayBudgetMs = resolveRemainingBudgetMs(params.deadlineMs);
286+
if (remainingDelayBudgetMs !== undefined && remainingDelayBudgetMs <= 0) {
287+
lastError = new Error(
288+
`cron model preflight chain budget exhausted after ${attempts} attempt${attempts === 1 ? "" : "s"}`,
289+
);
290+
break;
291+
}
292+
await sleepMs(
293+
remainingDelayBudgetMs === undefined
294+
? retryDelayMs
295+
: Math.min(retryDelayMs, remainingDelayBudgetMs),
296+
);
269297
}
270298
}
271299
}
272-
result = { status: "unavailable", error: lastError, attempts: maxAttempts };
300+
const result: EndpointPreflightResult = {
301+
status: "unavailable",
302+
error: lastError ?? new Error("cron model preflight chain budget exhausted"),
303+
attempts,
304+
};
273305
preflightCache.set(cacheKey, { checkedAtMs: nowMs, result });
274306
return buildUnavailableResult({
275307
provider: params.provider,

src/cron/isolated-agent/run.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ const cronDeliveryRuntimeLoader = createLazyImportLoader(() => import("./run-del
113113
const cronModelPreflightRuntimeLoader = createLazyImportLoader(
114114
() => import("./model-preflight.runtime.js"),
115115
);
116+
const CRON_MODEL_PREFLIGHT_CHAIN_BUDGET_MS = 55_000;
116117
const runtimePluginsLoader = createLazyImportLoader(
117118
() => import("../../plugins/runtime-plugins.runtime.js"),
118119
);
@@ -662,6 +663,7 @@ async function prepareCronRunContext(params: {
662663
model,
663664
useSubagentFallbacks,
664665
});
666+
const preflightDeadlineMs = Date.now() + CRON_MODEL_PREFLIGHT_CHAIN_BUDGET_MS;
665667
let selectedPreflightCandidate: { provider: string; model: string } | undefined;
666668
let selectedPreflightCandidateIndex = -1;
667669
let firstUnavailablePreflight:
@@ -672,6 +674,7 @@ async function prepareCronRunContext(params: {
672674
cfg: cfgWithAgentDefaults,
673675
provider: candidate.provider,
674676
model: candidate.model,
677+
deadlineMs: preflightDeadlineMs,
675678
});
676679
if (candidatePreflight.status === "available") {
677680
selectedPreflightCandidate = candidate;

0 commit comments

Comments
 (0)