Skip to content

[Bug]: Cron jobs enqueued but never execute — lane never dispatches, runningAtMs written on enqueue causes permanent stale marker loop #42960

@Captain-Scarlet

Description

@Captain-Scarlet

Bug type

Regression (worked before, now fails)

Summary

Cron jobs return enqueued: true from both scheduled triggers and openclaw cron run but never actually execute — no session start, no LLM call, no tool activity appears in logs.

Steps to reproduce

  1. Have one or more cron jobs configured in jobs.json with enabled: true
  2. Run: openclaw cron run
  3. Observe response: {"ok": true, "enqueued": true, "runId": "manual:..."}
  4. Wait 15-30 seconds
  5. Run: openclaw logs --max-bytes 50000 | tail -40
  6. Observe: only heartbeat and timer-armed lines — no session start, no LLM call, no execution
  7. Run: openclaw gateway restart
  8. Observe log: "cron: clearing stale running marker on startup" for the same jobId
  9. Repeat steps 2-8 — cycle repeats indefinitely

Expected behavior

After openclaw cron run returns enqueued: true, the cron lane should pick up the job within seconds, start an isolated session, send the payload to the configured agent and model, and log execution activity including session start, LLM request, and completion.

Actual behavior

The job is enqueued and runningAtMs is immediately written to jobs.json. The cron lane never dispatches it. No session is created, no LLM call is made. On the next gateway restart, OpenClaw logs "cron: clearing stale running marker on startup" for the job — then the same cycle repeats on the next enqueue. Deleting ~/.openclaw/cron/runs/ and manually removing runningAtMs from jobs.json does not resolve the issue — the marker is re-written on the next enqueue and the job still never executes.

OpenClaw version

2026.3.8 (3caab92)

Operating system

Ubuntu 24.04 LTS

Install method

npm install -g openclaw@latest — running as systemd service via openclaw-gateway.service

Model

vllm/Qwen/Qwen3.5-9B (self-hosted via vLLM on RunPod, OpenAI-compatible endpoint)

Provider / routing chain

vllm provider → https://-8000.proxy.runpod.net/v1 → vLLM serving Qwen/Qwen3.5-9B

Config file / key location

~/.openclaw/openclaw.json → models.providers.vllm, plugins.entries, agents.smith-vciso

Additional provider/model setup details

Provider defined with "api": "openai-completions", "contextWindow": 131072, "maxTokens": 16000. Agent smith-vciso uses "model": "vllm/Qwen/Qwen3.5-9B". The same model works correctly for interactive chat sessions — only cron execution is broken. The issue reproduces with any cron job regardless of which agent or model is configured.

Logs, screenshots, and evidence

07:43:36 warn cron clearing stale running marker on startup
  jobId: 57731dfe-e8d7-4ba7-92ef-eb6e408919be
  runningAtMs: 1773215045982

07:44:31 info { "ok": true, "enqueued": true,
  "runId": "manual:57731dfe-...:1773215071210:1" }

[next 5 minutes: only cron timer-armed + web heartbeat lines]
[no session start, no LLM call, no execution of any kind]

Impact and severity

Affected: All cron jobs across all agents — 100% failure rate
Severity: Blocks workflow entirely — the entire scheduled automation layer is non-functional
Frequency: Always reproduces — every enqueue, every scheduled trigger
Consequence: All scheduled HR agent workflows (morning prompts, evening reminders, daily reports, escalations) fail silently. No WhatsApp messages sent to team members. No reports generated. Production automation completely down.

Additional information

Interactive chat sessions via WhatsApp and Telegram work correctly — only the cron execution lane is affected. The bug persists across gateway restarts, runs/ directory deletion, and manual runningAtMs removal from jobs.json. Previously working cron jobs stopped executing after a period of LLM timeouts (RunPod pod was stopped while cron jobs were scheduled to run), suggesting the lane may have entered a broken state from unresolved LLM timeout errors and never recovered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions