Skip to content

Cron model preflight skips entire run when local primary is unreachable, ignoring configured cloud fallbacks [AI] #79329

@cedricjanssens

Description

@cedricjanssens

Cron model preflight skips entire run when local primary is unreachable, ignoring configured cloud fallbacks [AI]

Summary

When a cron job uses an agent whose model.primary is a local provider (e.g. ollama/gemma4:26b-nvfp4) and model.fallbacks lists a cloud provider (e.g. openrouter/nvidia/nemotron-3-super-120b-a12b:free), and the local provider endpoint is temporarily unreachable, the cron run is silently skipped with status: skipped instead of falling back to the cloud provider that is healthy.

The skip happens at preflight time inside the cron isolated-agent runner, before any model invocation. The fallback chain configured on the agent is never consulted.

This is distinct from the post-invocation fallback failures discussed in #44353 (provider-level errors) and #74985 (embedded agent timeout): here, the failure happens earlier — the preflight short-circuits the run.

Net effect for operators relying on local→cloud failover: when Ollama hiccups (busy, paused for upgrade, momentary network blip on 127.0.0.1), entire scheduled cron runs disappear with no retry until the next scheduled tick. For 6 daily scheduled runs over a transient 5-min Ollama outage, you can lose 1 entire run silently — and operationally pay it back the next morning when a watchdog finally fires.

Real behavior proof

A. Source-level proof

The preflight is implemented in src/cron/isolated-agent/model-preflight.runtime.ts (compiled at dist/model-preflight.runtime-D3BkBmU5.js):

// preflightCronModelProvider — params.provider/model = the *resolved primary*.
async function preflightCronModelProvider(params) {
    const providerConfig = resolveProviderConfig(params.cfg, params.provider);
    if (!providerConfig) return { status: "available" };
    const baseUrl = normalizeBaseUrl(providerConfig.baseUrl);
    const api = normalizeProbeApi(providerConfig);
    if (!baseUrl || !api || !isLocalProviderBaseUrl(baseUrl)) return { status: "available" };
    // ...probes baseUrl with 2.5s timeout, returns "unavailable" on failure...
}

The function only ever consults cfg.models.providers[params.provider].baseUrl. It never reads cfg.agents.list[*].model.fallbacks nor cfg.agents.defaults.model.fallbacks.

The caller in src/cron/isolated-agent/run.ts (compiled at dist/isolated-agent-DPJcOmiU.js:485-502) consumes only the status and short-circuits:

const preflight = await (await loadCronModelPreflightRuntime()).preflightCronModelProvider({
    cfg: cfgWithAgentDefaults,
    provider,    // resolved primary only
    model,
});
if (preflight.status === "unavailable") {
    logWarn(`[cron:${input.job.id}] ${preflight.reason}`);
    return {
        ok: false,
        result: withRunSession({
            status: "skipped",
            error: preflight.reason,
            diagnostics: createCronRunDiagnosticsFromError("model-preflight", preflight.reason, { severity: "warn" }),
            provider,
            model,
        })
    };
}

The return happens before any fallback resolver runs. The skip is final for this scheduled tick.

B. Standalone repro (no infra needed)

Save as repro.mjs and run with node repro.mjs:

import { preflightCronModelProvider } from "/opt/homebrew/lib/node_modules/openclaw/dist/model-preflight.runtime.js";

// Port libre → preflight TimeoutError → status:"unavailable"
const cfg = {
    models: {
        providers: {
            ollama:     { api: "ollama", baseUrl: "http://127.0.0.1:11999" },
            openrouter: { api: "openai-completions", baseUrl: "https://openrouter.ai/api/v1" },
        },
    },
    agents: {
        list: [{
            id: "bourse",
            model: {
                primary: "ollama/gemma4:26b-nvfp4",
                fallbacks: ["openrouter/nvidia/nemotron-3-super-120b-a12b:free"],
            },
        }],
    },
};

const r = await preflightCronModelProvider({
    cfg, provider: "ollama", model: "gemma4:26b-nvfp4",
});
console.log(r);

Output (verbatim):

{
  status: 'unavailable',
  provider: 'ollama',
  model: 'gemma4:26b-nvfp4',
  baseUrl: 'http://127.0.0.1:11999',
  retryAfterMs: 300000,
  reason: 'Agent cron job uses ollama/gemma4:26b-nvfp4 but the local provider endpoint is not reachable at http://127.0.0.1:11999. Skipping this cron run; OpenClaw will retry the provider preflight on a later scheduled run. Last error: TypeError: fetch failed'
}

The cfg.agents.list[0].model.fallbacks is fully populated and points to a healthy cloud provider, but the preflight result does not look at it.

C. Production trace (real cron run, redacted IDs)

Cron marche-preopen-eu (agent bourse), agent config at the time:

{
  "id": "bourse",
  "model": {
    "primary":  "ollama/gemma4:26b-nvfp4",
    "fallbacks": ["openrouter/nvidia/nemotron-3-super-120b-a12b:free"]
  }
}

Run history entry (Ollama briefly busy at 07:45 due to concurrent cron consuming RAM):

{
  "ts": 1778132782311,
  "action": "finished",
  "status": "skipped",
  "error": "Agent cron job uses ollama/gemma4:26b-nvfp4 but the local provider endpoint is not reachable at http://127.0.0.1:11434. Skipping this cron run; OpenClaw will retry the provider preflight on a later scheduled run. Last error: TimeoutError: request timed out",
  "diagnostics": {
    "summary": "Agent cron job uses ollama/gemma4:26b-nvfp4 but the local provider endpoint is not reachable at http://127.0.0.1:11434. Skipping this cron run; OpenClaw will retry the provider preflight on a later scheduled run. Last error: TimeoutError: request timed out",
    "entries": [{
      "source": "model-preflight",
      "severity": "warn",
      "message": "Agent cron job uses ollama/gemma4:26b-nvfp4 but the local provider endpoint is not reachable at http://127.0.0.1:11434. Skipping this cron run; OpenClaw will retry the provider preflight on a later scheduled run. Last error: TimeoutError: request timed out"
    }]
  },
  "model": "gemma4:26b-nvfp4",
  "provider": "ollama"
}

Three minutes later, Ollama responded normally; OpenRouter Nemotron was healthy throughout. The configured fallback would have run the cron successfully.

Verification

Reproducing the bug (no Ollama interference required)

  1. Confirm OpenClaw version: openclaw --version (tested on 2026.5.4).
  2. Save the standalone repro above as repro.mjs.
  3. Run: node repro.mjs.
  4. Observe status: "unavailable" with no consultation of the agents.list[*].model.fallbacks from the cfg.

End-to-end live verification (optional, requires controlled outage)

  1. Configure an agent with model.primary: "ollama/<model>" and model.fallbacks: ["<healthy cloud provider>/<model>"].
  2. Schedule a one-shot cron: openclaw cron add --agent <agent> --at 1m --message "..." --tools exec.
  3. Briefly stop Ollama (launchctl kill TERM gui/$UID/com.ollama on macOS, or systemctl stop ollama on Linux) ≈ 30s before the cron fires.
  4. Restart Ollama after the cron has fired.
  5. Inspect run with openclaw cron runs --id <id>.

Expected (current): status: "skipped", diagnostic source model-preflight, no fallback attempted.
Desired: status: "ok" with the fallback model used; or at minimum status: "skipped" only after the fallback chain has been exhausted.

Suggested fix (sketch — feedback welcome)

Two non-exclusive options:

  1. Defer preflight until after fallback resolution. Extend preflightCronModelProvider to receive the full fallback chain and walk it in order, returning available as soon as one candidate's local probe succeeds (or as soon as a cloud candidate is hit, since cloud preflight is currently a no-op). This keeps the existing semantic of "we only probe local providers".

  2. On unavailable, attempt fallback before returning skipped. In cron/isolated-agent/run.ts, when preflight is unavailable, look up the agent's model.fallbacks and rotate to the next candidate (re-running preflight for it if local). Only emit skipped when no candidate passes preflight.

Option 1 is preferred — it keeps the failure path centralized in one runtime and avoids racing with the in-flight fallback resolver used during agent invocation.

Related

Environment

  • OpenClaw 2026.5.4 (commit 325df3e)
  • macOS Darwin arm64 (Apple Silicon M4 Pro)
  • Node.js 25.x (npm global install)
  • Affected runtime: dist/model-preflight.runtime-D3BkBmU5.js + dist/isolated-agent-DPJcOmiU.js
  • Tested with: provider: ollama baseUrl http://127.0.0.1:11999 (port libre, garantit fetch failed)

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions