Skip to content

[Bug] Cron spin loop via MIN_REFIRE_GAP_MS fallback when computeJobNextRunAtMs returns undefined (incomplete fix for #17821) #66019

@HongfeiXu

Description

@HongfeiXu

Summary

applyJobResult in src/cron/service/timer.ts conflates two semantics on the MIN_REFIRE_GAP_MS constant:

  1. Lower bound for the next cron fire, protecting against the case where computeJobNextRunAtMs returns a value in the same second as the run just finished (the #17821 fix).
  2. Fallback value for when computeJobNextRunAtMs returns undefined entirely.

The second usage is wrong: when the schedule cannot be computed at all, falling back to endedAt + 2s causes the scheduler to re-fire the job 2 seconds later. That run then fails/skips for the same reason, and we're in a tight loop. The structurally identical bug exists in the error-backoff branch.

#17821 is the closed predecessor and covers the "landed in the same second" case. It does not cover the undefined case, and the existing regression tests (timer.regression.test.ts: spin-loop-17821 and spin-gap-17821) do not exercise it.

Root cause walkthrough

1. Silent undefined return from computeNextRunAtMs

src/cron/schedule.tscomputeNextRunAtMs:

const cron = resolveCronFromSchedule(schedule);
if (!cron) return;
let next = cron.nextRun(new Date(nowMs));
if (!next) return;

If croner's cron.nextRun() returns null (observed for certain tz/ICU combinations — see environment below), the function silently returns undefined. No exception is thrown.

2. Silent fall-through in computeStaggeredCronNextRunAtMs

src/cron/schedule.tscomputeStaggeredCronNextRunAtMs:

for (let attempt = 0; attempt < 4; attempt += 1) {
  const baseNext = computeNextRunAtMs(job.schedule, cursorMs);
  if (baseNext === undefined) return;
  const shifted = baseNext + offsetMs;
  if (shifted > nowMs) return shifted;
  cursorMs = Math.max(cursorMs + 1, baseNext + 1000);
}
// implicit undefined return

The 4-attempt loop can exit without returning, implicitly giving undefined.

3. applyJobResult bad fallback in success path

src/cron/service/timer.ts, cron success branch:

} else if (isJobEnabled(job)) {
  let naturalNext: number | undefined;
  try {
    naturalNext = opts?.preserveSchedule && job.schedule.kind === "every"
      ? computeNextWithPreservedLastRun(result.endedAt)
      : computeJobNextRunAtMs(job, result.endedAt);
  } catch (err) {
    recordScheduleComputeError({ state, job, err });
    // no return; execution continues
  }
  if (job.schedule.kind === "cron") {
    const minNext = result.endedAt + MIN_REFIRE_GAP_MS;
    job.state.nextRunAtMs =
      naturalNext !== undefined ? Math.max(naturalNext, minNext) : minNext;
    //                                                           ^^^^^^^ bug
  } else {
    job.state.nextRunAtMs = naturalNext;
  }
}

When naturalNext === undefined, nextRunAtMs becomes result.endedAt + 2000. Next tick (2 seconds later) the scheduler re-fires the job.

There's also a subtle secondary issue: recordScheduleComputeError already sets job.state.nextRunAtMs = undefined for the THROW case, but because the catch does not return, the subsequent assignment in the same branch overwrites it back to minNext.

4. Same shape in the error-backoff branch

} else if (result.status === "error" && isJobEnabled(job)) {
  // ...
  const backoffNext = result.endedAt + backoff;
  job.state.nextRunAtMs =
    normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext;
  //                                                             ^^^^^^^^^^^^ same issue
}

Here the fallback is errorBackoffMs(consecutiveErrors) rather than MIN_REFIRE_GAP_MS, which mitigates the tight loop via exponential backoff — but for the first few errors the interval is still short enough to cause noticeable channel spam, and the structural confusion is the same.

Observed symptom

A single cron job fires ~10 times in ~10 minutes. Each run completes in 4–6 seconds (agent turn duration); between runs the scheduler wakes up on its newly-set short nextRunAtMs and re-fires. Visible as channel spam (Telegram/WeChat messages repeating). On macOS specifically; same version on Ubuntu appears unaffected.

I don't have a deterministic minimal repro for naturalNext === undefined — a test * * * * * cron in Asia/Shanghai tz fired cleanly every 60s in my environment during verification. The undefined path appears to need a specific croner/ICU edge case that's not reliably triggered by a simple test job, but the structural bug is present in the code regardless of how often the path is hit.

Proposed fix

Treat undefined as "unknown, clear the schedule" rather than as "use the minimum gap". Success path:

if (job.schedule.kind === "cron") {
  if (naturalNext !== undefined) {
    const minNext = result.endedAt + MIN_REFIRE_GAP_MS;
    job.state.nextRunAtMs = Math.max(naturalNext, minNext);
  } else {
    // Schedule computation failed; clearing nextRunAtMs avoids a
    // respin loop. The job will be rescheduled on the next heartbeat /
    // timer arm when the schedule can be recomputed from scratch.
    job.state.nextRunAtMs = undefined;
    state.deps.log.warn(
      { jobId: job.id, jobName: job.name },
      "cron: nextRun unresolved — clearing to avoid respin loop",
    );
  }
} else {
  job.state.nextRunAtMs = naturalNext;
}

The error-backoff branch should get a parallel treatment. Additionally the catch blocks around computeJobNextRunAtMs could early-return after recordScheduleComputeError (or set a flag checked below) so the subsequent assignment doesn't race with the error handler.

Suggested regression test

In src/cron/service/timer.regression.test.ts, alongside spin-loop-17821 / spin-gap-17821:

it("clears nextRunAtMs when schedule compute returns undefined (does not spin)", async () => {
  // Arrange: enabled cron job whose computeJobNextRunAtMs path returns undefined
  // (mock via a stubbed croner/schedule helper, or a test hook).
  // ...
  applyJobResult(state, job, { status: "ok", startedAt, endedAt });
  expect(job.state.nextRunAtMs).toBeUndefined();
});

An analogous test for the error-backoff branch would prevent regression in both paths.

Environment

  • OpenClaw: 2026.4.11 (installed via Homebrew, /opt/homebrew/lib/node_modules/openclaw/)
  • OS: macOS 26.4.1 (Darwin 25.4.0), arm64, MacBook Pro 14" (M5)
  • Node: v25.9.0
  • Affected job combinations observed: sessionTarget: isolated + payload.kind: agentTurn and sessionTarget: main + payload.kind: systemEvent + wakeMode: next-heartbeat
  • Not observed on Ubuntu running the same OpenClaw version, suggesting croner/ICU/tz edge case is platform-specific at the trigger layer — though the structural bug above is platform-independent.

Related

  • #17821 — closed; added the MIN_REFIRE_GAP_MS safety net for the same-second case. This issue reports the undefined case that fix missed.
  • #52097 — open; "Cron job runs repeatedly after manual trigger". Symptom similar, root cause not confirmed the same but worth cross-linking for triage.

Local patch (for reference)

I've been running the success-path fix above on the shipped bundle since the bug was diagnosed, with cron jobs behaving correctly. The new warn log line has not fired in my environment, consistent with the note above that the undefined path is rare on my specific setup — the patch functions as a safety net rather than a hot path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions