fix(cron): prevent scheduler death when startup catch-up fails by alexanderxfgl-bit · Pull Request #68112 · openclaw/openclaw

alexanderxfgl-bit · 2026-04-17T12:32:56Z

Summary

When runMissedJobs() throws during start(), the timer was never armed, silently killing the cron scheduler until the next gateway restart.

Root Cause

The start() function in src/cron/service/ops.ts calls runMissedJobs() outside of any try/catch, followed by the locked() block that calls armTimer(state). If runMissedJobs() throws (e.g. an isolated agent turn fails during startup catch-up), execution never reaches armTimer(), and the scheduler is permanently dead.

Changes

Guard armTimer in start() - Wrap runMissedJobs() in try/catch so the timer is always armed
Zombie detection in onTimer() - Clear stale runningAtMs markers at runtime (threshold: 2x job timeout or 2 min) so jobs recover without restart
Tests - Verify timer stays armed on catch-up failure and stale markers are cleared

Fixes #67854
Related: #59056, #60799

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d98b84d866

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T12:37:34Z

+        const threshold =
+          typeof jobTimeoutMs === "number" && jobTimeoutMs > 0
+            ? jobTimeoutMs * 2
+            : zombieHardCapMs;


Preserve running marker for no-timeout jobs

The new zombie-marker logic can clear runningAtMs after only 2 minutes when resolveCronJobTimeoutMs(job) returns undefined (for supported timeoutSeconds <= 0 no-timeout agent turns). During a long manual cron run, state.running is false, so onTimer() still executes; once this marker is cleared, collectRunnableJobs() can pick the same past-due job and start a second concurrent execution, causing duplicate sends/agent turns and inconsistent job state. This is a regression from the prior 2-hour stuck-marker window used by maintenance normalization.

Useful? React with 👍 / 👎.

Fixed in 5eb7e17. Jobs with resolveCronJobTimeoutMs returning undefined (timeoutSeconds <= 0, explicitly unlimited) are now skipped entirely in the zombie detection loop. Only jobs with a concrete timeout value will ever have their markers cleared.

greptile-apps · 2026-04-17T12:37:58Z

Greptile Summary

This PR fixes a silent cron scheduler death by wrapping runMissedJobs in a try/catch inside start(), and adds zombie detection to onTimer() to clear stale runningAtMs markers at runtime. The ops.ts fix is correct, but two issues require attention before merge:

The primary test ("arms timer even when runMissedJobs throws") is a false positive: the job routes through executeMainSessionCronJob (not executeDetachedCronJob), so runIsolatedAgentJob is never called, runMissedJobs never throws, and the try/catch guard is never exercised.
The zombie detection in onTimer() falls back to a 2-minute hard cap when resolveCronJobTimeoutMs returns undefined (jobs with timeoutSeconds <= 0), which can prematurely clear a live marker and cause a concurrent duplicate execution for explicitly-unlimited jobs.

Confidence Score: 4/5

The core try/catch fix is sound, but the zombie detection threshold bug can cause double-execution of unlimited-timeout jobs, and the primary test provides no real coverage of the fix.

Two P1 findings: a false-positive test that doesn't exercise the fix, and a zombie-detection threshold that can cause concurrent duplicate job executions for explicitly-configured unlimited-timeout jobs. The ops.ts change is correct in isolation.

src/cron/service/timer.ts (zombie threshold for undefined timeout) and src/cron/service.start-error-resilience.test.ts (false-positive test)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/cron/service.start-error-resilience.test.ts
Line: 15-70

Comment:
**Test does not exercise the try/catch fix**

The job uses `sessionTarget: "main"` with `payload.kind: "systemEvent"`, which routes through `executeMainSessionCronJob`. That path calls `enqueueSystemEvent`/`requestHeartbeatNow` and returns `{ status: "ok" }` — it never calls `runIsolatedAgentJob`. The throwing mock has no effect, `runMissedJobs` completes normally, and the try/catch guard added by this PR is never triggered. The test would pass identically with or without the fix.

To actually exercise the guard, the job would need to cause an infrastructure-level failure in `runMissedJobs` (e.g., mock `persist` or `ensureLoaded` to throw), since individual job-execution errors are already caught inside `runStartupCatchupCandidate` and returned as an error result rather than re-thrown.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 712-723

Comment:
**2-minute hard cap fires for explicitly-disabled-timeout jobs**

`resolveCronJobTimeoutMs` returns `undefined` when a job is configured with `timeoutSeconds <= 0` (explicitly unlimited). The zombie threshold in that branch falls back to `MAX_TIMER_DELAY_MS * 2 = 120 s`. A job intentionally running for hours without a timeout will have its `runningAtMs` marker cleared after 2 minutes, allowing `collectRunnableJobs` to pick it up again on the very next tick and start a second concurrent instance.

The fix should skip zombie detection (or use a much larger sentinel) when `jobTimeoutMs` is `undefined`, consistent with the semantics of `executeJobCoreWithTimeout` which already skips the timeout entirely for `undefined`:

```ts
const threshold =
  typeof jobTimeoutMs === "number" && jobTimeoutMs > 0
    ? jobTimeoutMs * 2
    : zombieHardCapMs;
```
→ change to skip the zombie check entirely when `jobTimeoutMs` is `undefined`:
```ts
if (typeof jobTimeoutMs !== "number" || jobTimeoutMs <= 0) {
  continue; // no timeout configured → never treat as zombie
}
const threshold = jobTimeoutMs * 2;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service.start-error-resilience.test.ts
Line: 4

Comment:
**Unused import**

`CronService` is imported but not referenced anywhere in the test file.

```suggestion
```
(remove the import)

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service.start-error-resilience.test.ts
Line: 72-121

Comment:
**Second test covers pre-existing startup cleanup, not the new `onTimer()` zombie detection**

The `start()` function already clears stale `runningAtMs` markers in its initial `locked()` block before `runMissedJobs` runs. This test exercises that existing path, not the new zombie-detection logic added to `onTimer()`. The new `onTimer()` code (the `for (const job of state.store?.jobs ?? [])` loop that clears markers at runtime) has no test coverage in this PR.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(cron): prevent scheduler death when ..." | Re-trigger Greptile}

greptile-apps · 2026-04-17T12:38:02Z

+  it("arms timer even when runMissedJobs throws", async () => {
+    const storePath = makeStorePath();
+    await fs.mkdir(path.dirname(storePath), { recursive: true });
+
+    // Write a store with one overdue job so runMissedJobs has work to do
+    const overdueAtMs = Date.now() - 120_000;
+    await fs.writeFile(
+      storePath,
+      JSON.stringify(
+        {
+          version: 1,
+          jobs: [
+            {
+              id: "test-job-1",
+              name: "overdue-job",
+              enabled: true,
+              createdAtMs: overdueAtMs - 60_000,
+              updatedAtMs: overdueAtMs - 60_000,
+              schedule: { kind: "every", everyMs: 60_000, anchorMs: overdueAtMs - 60_000 },
+              sessionTarget: "main",
+              wakeMode: "next-heartbeat",
+              payload: { kind: "systemEvent", text: "tick" },
+              state: { nextRunAtMs: overdueAtMs },
+            },
+          ],
+        },
+        null,
+        2,
+      ),
+      "utf-8",
+    );
+
+    const enqueueSystemEvent = vi.fn();
+    const requestHeartbeatNow = vi.fn();
+    const runIsolatedAgentJob = vi.fn(async () => {
+      throw new Error("simulated catastrophic failure in missed job execution");
+    });
+
+    const state = createCronServiceState({
+      storePath,
+      cronEnabled: true,
+      log: noopLogger,
+      enqueueSystemEvent: enqueueSystemEvent as never,
+      requestHeartbeatNow: requestHeartbeatNow as never,
+      runIsolatedAgentJob: runIsolatedAgentJob as never,
+    });
+
+    // start() should NOT throw even though runMissedJobs will fail
+    await start(state);
+
+    // The critical assertion: the timer MUST be armed despite the error
+    expect(state.timer).not.toBeNull();
+    // The store should still be loaded
+    expect(state.store).not.toBeNull();
+    expect(state.store!.jobs.length).toBe(1);
+  });


Test does not exercise the try/catch fix

The job uses sessionTarget: "main" with payload.kind: "systemEvent", which routes through executeMainSessionCronJob. That path calls enqueueSystemEvent/requestHeartbeatNow and returns { status: "ok" } — it never calls runIsolatedAgentJob. The throwing mock has no effect, runMissedJobs completes normally, and the try/catch guard added by this PR is never triggered. The test would pass identically with or without the fix.

To actually exercise the guard, the job would need to cause an infrastructure-level failure in runMissedJobs (e.g., mock persist or ensureLoaded to throw), since individual job-execution errors are already caught inside runStartupCatchupCandidate and returned as an error result rather than re-thrown.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service.start-error-resilience.test.ts Line: 15-70 Comment: **Test does not exercise the try/catch fix** The job uses `sessionTarget: "main"` with `payload.kind: "systemEvent"`, which routes through `executeMainSessionCronJob`. That path calls `enqueueSystemEvent`/`requestHeartbeatNow` and returns `{ status: "ok" }` — it never calls `runIsolatedAgentJob`. The throwing mock has no effect, `runMissedJobs` completes normally, and the try/catch guard added by this PR is never triggered. The test would pass identically with or without the fix. To actually exercise the guard, the job would need to cause an infrastructure-level failure in `runMissedJobs` (e.g., mock `persist` or `ensureLoaded` to throw), since individual job-execution errors are already caught inside `runStartupCatchupCandidate` and returned as an error result rather than re-thrown. How can I resolve this? If you propose a fix, please make it concise.

Fixed in 5eb7e17. The test now:

Uses an isolated agentTurn job (so the missed-job code path is exercised)

Mocks fs.writeFile to fail on the second write attempt — this causes persist() inside applyStartupCatchupOutcomes to throw, which propagates through runMissedJobs() and triggers the try/catch guard in start()

Verifies state.timer is not null (timer was armed despite the failure)

This provides real coverage of the fix.

greptile-apps · 2026-04-17T12:38:03Z

+      const zombieHardCapMs = MAX_TIMER_DELAY_MS * 2;
+      let clearedZombies = false;
+      for (const job of state.store?.jobs ?? []) {
+        if (typeof job.state.runningAtMs !== "number") {
+          continue;
+        }
+        const elapsed = dueCheckNow - job.state.runningAtMs;
+        const jobTimeoutMs = resolveCronJobTimeoutMs(job);
+        const threshold =
+          typeof jobTimeoutMs === "number" && jobTimeoutMs > 0
+            ? jobTimeoutMs * 2
+            : zombieHardCapMs;


2-minute hard cap fires for explicitly-disabled-timeout jobs

resolveCronJobTimeoutMs returns undefined when a job is configured with timeoutSeconds <= 0 (explicitly unlimited). The zombie threshold in that branch falls back to MAX_TIMER_DELAY_MS * 2 = 120 s. A job intentionally running for hours without a timeout will have its runningAtMs marker cleared after 2 minutes, allowing collectRunnableJobs to pick it up again on the very next tick and start a second concurrent instance.

The fix should skip zombie detection (or use a much larger sentinel) when jobTimeoutMs is undefined, consistent with the semantics of executeJobCoreWithTimeout which already skips the timeout entirely for undefined:

const threshold = typeof jobTimeoutMs === "number" && jobTimeoutMs > 0 ? jobTimeoutMs * 2 : zombieHardCapMs;

→ change to skip the zombie check entirely when jobTimeoutMs is undefined:

if (typeof jobTimeoutMs !== "number" || jobTimeoutMs <= 0) { continue; // no timeout configured → never treat as zombie } const threshold = jobTimeoutMs * 2;

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service/timer.ts Line: 712-723 Comment: **2-minute hard cap fires for explicitly-disabled-timeout jobs** `resolveCronJobTimeoutMs` returns `undefined` when a job is configured with `timeoutSeconds <= 0` (explicitly unlimited). The zombie threshold in that branch falls back to `MAX_TIMER_DELAY_MS * 2 = 120 s`. A job intentionally running for hours without a timeout will have its `runningAtMs` marker cleared after 2 minutes, allowing `collectRunnableJobs` to pick it up again on the very next tick and start a second concurrent instance. The fix should skip zombie detection (or use a much larger sentinel) when `jobTimeoutMs` is `undefined`, consistent with the semantics of `executeJobCoreWithTimeout` which already skips the timeout entirely for `undefined`: ```ts const threshold = typeof jobTimeoutMs === "number" && jobTimeoutMs > 0 ? jobTimeoutMs * 2 : zombieHardCapMs; ``` → change to skip the zombie check entirely when `jobTimeoutMs` is `undefined`: ```ts if (typeof jobTimeoutMs !== "number" || jobTimeoutMs <= 0) { continue; // no timeout configured → never treat as zombie } const threshold = jobTimeoutMs * 2; ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in 5eb7e17. Applied the exact suggestion — jobs with undefined/no timeout are now continued past, so they're never treated as zombies. The zombieHardCapMs fallback has been removed entirely.

Also added a test (does NOT clear runningAtMs for unlimited-timeout jobs) that verifies a job with timeoutSeconds: 0 running for 2+ hours keeps its marker intact.

greptile-apps · 2026-04-17T12:38:04Z

+import fs from "node:fs/promises";
+import path from "node:path";
+import { describe, expect, it, vi } from "vitest";
+import { CronService } from "./service.js";


Unused import

CronService is imported but not referenced anywhere in the test file.

Suggested change

import { CronService } from "./service.js";

(remove the import)

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service.start-error-resilience.test.ts Line: 4 Comment: **Unused import** `CronService` is imported but not referenced anywhere in the test file. ```suggestion ``` (remove the import) How can I resolve this? If you propose a fix, please make it concise.

Fixed in 5eb7e17. Removed the unused CronService import.

greptile-apps · 2026-04-17T12:38:05Z

+  it("clears stale runningAtMs markers on startup", async () => {
+    const storePath = makeStorePath();
+    await fs.mkdir(path.dirname(storePath), { recursive: true });
+
+    const staleRunningAtMs = Date.now() - 3_600_000; // 1 hour ago
+    await fs.writeFile(
+      storePath,
+      JSON.stringify(
+        {
+          version: 1,
+          jobs: [
+            {
+              id: "stale-job",
+              name: "stale-running-job",
+              enabled: true,
+              createdAtMs: staleRunningAtMs - 60_000,
+              updatedAtMs: staleRunningAtMs,
+              schedule: { kind: "cron", expr: "*/5 * * * *" },
+              sessionTarget: "main",
+              wakeMode: "next-heartbeat",
+              payload: { kind: "systemEvent", text: "tick" },
+              state: {
+                nextRunAtMs: Date.now() - 300_000,
+                runningAtMs: staleRunningAtMs,
+              },
+            },
+          ],
+        },
+        null,
+        2,
+      ),
+      "utf-8",
+    );
+
+    const state = createCronServiceState({
+      storePath,
+      cronEnabled: true,
+      log: noopLogger,
+      enqueueSystemEvent: vi.fn() as never,
+      requestHeartbeatNow: vi.fn() as never,
+      runIsolatedAgentJob: vi.fn(async () => ({ status: "ok" as const })) as never,
+    });
+
+    await start(state);
+
+    // The stale runningAtMs should be cleared
+    expect(state.store!.jobs[0].state.runningAtMs).toBeUndefined();
+    // Timer should be armed
+    expect(state.timer).not.toBeNull();
+  });


Second test covers pre-existing startup cleanup, not the new onTimer() zombie detection

The start() function already clears stale runningAtMs markers in its initial locked() block before runMissedJobs runs. This test exercises that existing path, not the new zombie-detection logic added to onTimer(). The new onTimer() code (the for (const job of state.store?.jobs ?? []) loop that clears markers at runtime) has no test coverage in this PR.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service.start-error-resilience.test.ts Line: 72-121 Comment: **Second test covers pre-existing startup cleanup, not the new `onTimer()` zombie detection** The `start()` function already clears stale `runningAtMs` markers in its initial `locked()` block before `runMissedJobs` runs. This test exercises that existing path, not the new zombie-detection logic added to `onTimer()`. The new `onTimer()` code (the `for (const job of state.store?.jobs ?? [])` loop that clears markers at runtime) has no test coverage in this PR. How can I resolve this? If you propose a fix, please make it concise.

Fixed in 5eb7e17. Added two new tests in a separate describe block CronService onTimer() zombie detection:

clears zombie runningAtMs markers for jobs that exceeded their timeout — exercises the new onTimer() zombie-detection code by setting a stale runningAtMs on a job that's past due, then calling onTimer() to verify the marker is cleared.

does NOT clear runningAtMs for unlimited-timeout jobs — verifies the zombie detection skips jobs with timeoutSeconds: 0.

The existing startup cleanup test is kept as-is since it still provides value (regression guard for the startup path).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5eb7e17ca3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T15:00:21Z

+  } catch (err) {
+    state.deps.log.error(
+      { err: String(err) },
+      "cron: startup catch-up failed; arming timer anyway to keep scheduler alive",
+    );


Recover startup catch-up state before swallowing errors

start() now always continues after runMissedJobs() throws, but runMissedJobs() performs non-atomic writes: planStartupCatchup() persists runningAtMs markers before execution, and a later failure (for example in applyStartupCatchupOutcomes persistence) can leave those markers on disk. Continuing startup from this catch block arms the scheduler against that partially-updated state, so onTimer() force-reloads jobs that still look running and skips them, which can stall missed jobs for a long time (or replay them later once a stale marker is eventually cleared). The catch path should repair/reload state (or rethrow) before proceeding.

Useful? React with 👍 / 👎.

Fixed in 820a84ef. The catch block now repairs partial state left by the failed catch-up:

Clears stale runningAtMs markers — planStartupCatchup() persists runningAtMs before execution. If a later step throws, the catch block reloads from disk and clears those markers so jobs can run on the very next tick instead of being stuck until the runtime zombie detector fires.

Clears nextRunAtMs for interrupted one-shot jobs — interruptedOneShotIds was computed in the first locked() block but never persisted if runMissedJobs() throws. The catch block clears nextRunAtMs for those one-shot jobs to prevent double-execution on the first tick.

Added a dedicated test (repairs one-shot job state when runMissedJobs throws) that verifies both runningAtMs and nextRunAtMs are cleared for one-shot jobs after a catch-up failure.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48dc91dbc3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T15:16:48Z

+  } catch (err) {
+    state.deps.log.error(
+      { err: String(err) },
+      "cron: startup catch-up failed; arming timer anyway to keep scheduler alive",
+    );


Repair startup catch-up state before swallowing errors

This catch path logs and continues even when runMissedJobs fails after planStartupCatchup() has already persisted runningAtMs markers but before applyStartupCatchupOutcomes() can clear/finalize them. In that partial-write state, subsequent ticks reload the persisted markers and treat those jobs as already running, so missed runs are deferred until stale-marker cleanup (and unlimited-timeout jobs are exempt from the new timeout-based zombie clear), which can leave startup catch-up work stuck for a long window after a transient persistence failure. The startup error path should reload/repair marker state (or rethrow) before arming the timer.

Useful? React with 👍 / 👎.

Fixed in 820a84ef. The catch block now repairs partial state left by the failed catch-up:

Clears stale runningAtMs markers — planStartupCatchup() persists runningAtMs before execution. If a later step throws, the catch block reloads from disk and clears those markers so jobs can run on the very next tick instead of being stuck until the runtime zombie detector fires.

Clears nextRunAtMs for interrupted one-shot jobs — interruptedOneShotIds was computed in the first locked() block but never persisted if runMissedJobs() throws. The catch block clears nextRunAtMs for those one-shot jobs to prevent double-execution on the first tick.

Added a dedicated test (repairs one-shot job state when runMissedJobs throws) that verifies both runningAtMs and nextRunAtMs are cleared for one-shot jobs after a catch-up failure.

chatgpt-codex-connector

💡 Codex Review

https://github.com/openclaw/openclaw/blob/1dc05ce5208f18b7bf8466855a713bfec5936a86/src/cron/service/ops.ts#L153
Skip recompute when startup catch-up fails

After runMissedJobs is caught and startup continues, this unconditional recomputeNextRuns(state) can advance already-due nextRunAtMs values to future slots even though startup catch-up did not complete. In the failure path this commit introduces (for example, a persist failure during catch-up), that converts missed executions into silently skipped runs rather than recoverable delayed work, because later ticks no longer see those jobs as due once markers are eventually cleared. This recompute/persist step should be conditional on successful catch-up (or use a maintenance-only recompute that preserves due slots).

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd0f787c77

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T15:41:01Z

+  } catch (err) {
+    state.deps.log.error(
+      { err: String(err) },
+      "cron: startup catch-up failed; arming timer anyway to keep scheduler alive",
+    );


Preserve interrupted one-shot skip list after catch-up failure

If runMissedJobs() throws (for example after planStartupCatchup() has already cleared startup runningAtMs markers), this catch path logs and proceeds to arm the scheduler without preserving interruptedOneShotIds. Those one-shot jobs now have no runningAtMs and still have past-due nextRunAtMs, so the first normal timer tick can execute them via collectRunnableJobs()/isRunnableJob(), violating the intended "one-shot jobs are not retried after interruption" behavior and potentially duplicating one-shot side effects.

Useful? React with 👍 / 👎.

Fixed in 820a84ef. The catch block now repairs partial state left by the failed catch-up:

Clears stale runningAtMs markers — planStartupCatchup() persists runningAtMs before execution. If a later step throws, the catch block reloads from disk and clears those markers so jobs can run on the very next tick instead of being stuck until the runtime zombie detector fires.

Clears nextRunAtMs for interrupted one-shot jobs — interruptedOneShotIds was computed in the first locked() block but never persisted if runMissedJobs() throws. The catch block clears nextRunAtMs for those one-shot jobs to prevent double-execution on the first tick.

Added a dedicated test (repairs one-shot job state when runMissedJobs throws) that verifies both runningAtMs and nextRunAtMs are cleared for one-shot jobs after a catch-up failure.

alexanderxfgl-bit · 2026-04-17T15:54:18Z

CI Status Update

✅ My changes pass: `checks-node-core-runtime` (all 6 tests green)

❌ Pre-existing upstream failures (also failing on latest main `1fad8efa`)

The check, check-additional, and checks-fast-contracts-protocol failures are all pre-existing TypeScript errors in the upstream main branch, not introduced by this PR:

extensions/discord/src/monitor/agent-components.deps.runtime.ts - missing exports
src/commands/agent.acp.test.ts - type mismatch
src/commands/channel-setup/plugin-install.test.ts - spread argument issue
src/commands/message.test.ts - missing property cfg
src/commands/status.test.ts - wrong property name

Verified: the latest CI run on main (commit 1fad8efa, workflow run 24573839483) also shows failure for the same reasons.

Changes in this push (rebased + fixed tests)

Rebased onto latest main (acace04)
Extracted clearZombieRunningMarkers() into a standalone exported function for testability
Fixed unit tests to use agentTurn payloads so timeoutSeconds is respected by resolveCronJobTimeoutMs
All review feedback addressed (see replies to each comment)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 820a84efa1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T16:16:46Z

+    // would be re-executed on the first tick.  Clear their nextRunAtMs to
+    // prevent double-execution.
+    await locked(state, async () => {
+      await ensureLoaded(state, { skipRecompute: true });


Force-reload store before repairing catch-up failure

When runMissedJobs() fails after applyStartupCatchupOutcomes() has already updated state.store in memory but before that state is written, this block can incorrectly conclude nothing is dirty and skip repair persistence. Because ensureLoaded(state, { skipRecompute: true }) does not reload when state.store is already set, the catch path never sees stale on-disk runningAtMs markers from planStartupCatchup(), and the next onTimer() force-reload treats those jobs as still running and skips them for a long window.

Useful? React with 👍 / 👎.

Fixed in 7958f679. Added forceReload: true to the ensureLoaded call in the catch block so it reloads from disk and sees the stale runningAtMs markers that planStartupCatchup() persisted, rather than relying on the in-memory state which may have been updated by applyStartupCatchupOutcomes() before the write failed.

chatgpt-codex-connector · 2026-04-17T16:16:46Z

+        }
+      }
+      if (dirty) {
+        await persist(state);


Keep timer arming resilient to repair persist errors

This recovery block can still prevent scheduler startup: if runMissedJobs() threw due a write failure, persist(state) here can throw the same error and bubble out of start() before the later armTimer(state) call. In that scenario the scheduler still dies until restart, so this cleanup persist needs to be best-effort (or separately caught) to preserve the intended "arm timer anyway" behavior.

Useful? React with 👍 / 👎.

Fixed in 7958f679. The repair locked() block is now wrapped in a nested try/catch. If the repair persist also fails (e.g. transient I/O), the error is logged and the scheduler proceeds to arm the timer. Added a dedicated test (arms timer even when repair persist also fails) that verifies both error paths are logged and the timer is still armed.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 215cae0f61

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T16:24:20Z

+      if (dirty) {
+        await persist(state);
+      }


Guard catch-path persist so startup still arms timer

The startup recovery path still performs await persist(state) without a nested safeguard. If runMissedJobs() failed because writes are currently failing (for example transient I/O or permission errors), this second persist can throw and abort start() before control reaches the later block that calls armTimer(state), so the scheduler can still remain dead under the same class of startup failure this change is meant to tolerate.

Useful? React with 👍 / 👎.

Fixed in 7958f679. Same fix as above - nested try/catch around the repair persist ensures startup always reaches armTimer. See the test arms timer even when repair persist also fails for coverage.

chatgpt-codex-connector · 2026-04-17T16:24:21Z

+      // Allow the first write (startup stale-marker cleanup) and the second
+      // write (planStartupCatchup sets runningAtMs). Fail the third write
+      // (applyStartupCatchupOutcomes persist) to trigger the catch path.
+      if (writeCount > 2 && typeof file === "string" && file.includes("cron")) {
+        throw new Error("simulated disk failure during persist");


Make the resilience test actually fail runMissedJobs

This test's failure injection does not match the fixture behavior: the job setup here does not trigger a startup stale-marker persist, so there are only two cron-store writes before start() completes. With writeCount > 2, no write is failed, meaning the runMissedJobs() catch path is never exercised and the test can pass even when startup error-recovery logic is broken.

Useful? React with 👍 / 👎.

Fixed in 7958f679. The job fixture has no initial runningAtMs, so the startup stale-marker cleanup does not persist. Writes are: (1) planStartupCatchup persist, (2) applyStartupCatchupOutcomes persist. Changed the threshold to fail on write #2 (writeCount === 2) so the catch path is actually exercised. Also updated the comment to document the actual write sequence.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7958f679f0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T16:34:26Z

+    // zombie detector in onTimer() will eventually clean up stale markers.
+    try {
+      await locked(state, async () => {
+        await ensureLoaded(state, { forceReload: true, skipRecompute: true });


Preserve applied catch-up outcomes during repair

When runMissedJobs() fails after applyStartupCatchupOutcomes() has already updated state.store in memory but before that write is persisted, this forced reload discards those applied outcomes and reverts to the pre-outcome disk snapshot. The subsequent repair only clears runningAtMs, so jobs that already executed can become runnable again (recurring jobs remain past-due; one-shot jobs can remain enabled) and re-run on the next tick, duplicating side effects such as sends/agent turns. The repair path should keep or merge the in-memory outcomes instead of replacing them with a stale reload.

Useful? React with 👍 / 👎.

Fixed in acd38664. Removed forceReload from the catch block ensureLoaded. If applyStartupCatchupOutcomes already updated in-memory state (cleared runningAtMs for completed jobs, advanced nextRunAtMs) before its persist threw, reloading would discard those outcomes and make completed jobs runnable again. Now we work with the in-memory state and just clear remaining runningAtMs markers from incomplete jobs. Also added one-shot protection: when clearing runningAtMs on at-kind jobs, nextRunAtMs is also cleared to prevent re-execution.

chatgpt-codex-connector · 2026-04-17T16:34:26Z

+        { jobId: job.id, jobName: job.name, elapsed, threshold },
+        "cron: clearing zombie running marker",
+      );
+      job.state.runningAtMs = undefined;


Exclude one-shot jobs from zombie-marker clearing

This zombie cleanup clears runningAtMs for any timed job, including one-shot (schedule.kind === "at") jobs. If a one-shot execution completed but its outcome persist failed, the store can retain a stale runningAtMs plus past-due nextRunAtMs; once this marker is cleared, normal timer selection treats that one-shot as runnable and executes it again, violating the intended no-retry-after-interruption behavior and risking duplicate side effects. The zombie path should skip one-shots or clear their runnable state together with the marker.

Useful? React with 👍 / 👎.

Fixed in acd38664. clearZombieRunningMarkers now clears nextRunAtMs alongside runningAtMs for one-shot (kind=at) jobs, preventing the zombie-clear path from making a one-shot runnable again after its outcome persist failed.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: acd38664b3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T16:46:17Z

+          if (interruptedOneShotIds.has(job.id) && typeof job.state.nextRunAtMs === "number") {
+            job.state.nextRunAtMs = undefined;
+            dirty = true;


Preserve one-shot skip state after catch-up failure

When runMissedJobs() throws, this block clears nextRunAtMs for interrupted one-shot jobs, but start() then unconditionally calls recomputeNextRuns(state) in the next locked section, which repopulates missing nextRunAtMs for schedule.kind === "at" (unless lastStatus === "ok"). In the common interrupted-run case (lastStatus unset/error), those one-shots become runnable again on the first timer tick and can execute twice, which defeats the intended "do not retry interrupted one-shot" behavior in this recovery path.

Useful? React with 👍 / 👎.

Fixed in 45329efe. recomputeNextRuns in the final locked() block now checks interruptedOneShotIds and re-clears nextRunAtMs for those jobs after recompute runs. This prevents computeJobNextRunAtMs from repopulating nextRunAtMs for one-shots where lastStatus !== ok (which always returns atMs for kind=at jobs).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23e2796e3b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T17:24:17Z

+      if (job.schedule.kind === "at" && typeof job.state.nextRunAtMs === "number") {
+        job.state.nextRunAtMs = undefined;
+      }


Preserve one-shot skip state after zombie marker cleanup

Clearing nextRunAtMs here is not sufficient to prevent one-shot re-execution: in the same timer tick, onTimer() can call recomputeNextRunsForMaintenance(...), and for enabled schedule.kind === "at" jobs with lastStatus !== "ok", computeJobNextRunAtMs repopulates nextRunAtMs with the original at time. That makes the zombie-cleared one-shot runnable again on the next tick, so interrupted one-shots can execute twice (duplicate sends/side effects) after a stale-marker recovery.

Useful? React with 👍 / 👎.

Fixed in fb1420fe. After recomputeNextRunsForMaintenance runs in the empty-due branch, I re-check one-shot jobs that had zombie markers cleared. For kind=at jobs with no runningAtMs, no lastRunAtMs, and lastStatus !== ok, nextRunAtMs is cleared again to prevent the interrupted one-shot from being re-enabled. This guard only runs when zombies were actually cleared (clearedZombies flag), so normal maintenance recomputation is unaffected.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5ac147869

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T17:37:11Z

    if (changed) {
      await persist(state);
    }


Arm timer even when post-recompute persist fails

This block can still leave cron dead on startup: the new interrupted-one-shot cleanup sets changed = true, which forces persist(state) before armTimer(state). If that write throws (for example, a transient I/O failure after runMissedJobs was already caught), start() exits before arming the timer, so scheduling remains down until restart. Making this persist best-effort or guaranteeing armTimer in a finally path would preserve the intended resilience.

Useful? React with 👍 / 👎.

Fixed in 8eec88a3. Wrapped the final locked block's persist in try/catch with error logging, so armTimer always executes regardless of persist failure.

Fixed in 8eec88a3. Wrapped the final locked block persist in try/catch with error logging, so armTimer always executes regardless of persist failure.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb1420feba

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T17:45:46Z

+              typeof job.state.lastRunAtMs !== "number"
+            ) {


Preserve interrupted one-shot retries after zombie cleanup

Fresh evidence beyond the earlier thread: this new guard only re-clears nextRunAtMs when lastRunAtMs is missing, so interrupted one-shot retry runs (which already have lastRunAtMs from a prior failed attempt) are excluded. In that case, recomputeNextRunsForMaintenance repopulates nextRunAtMs from the original at schedule, making the job runnable again on the next tick and allowing duplicate side effects after an interrupted run.

Useful? React with 👍 / 👎.

Fixed in 4976ce7a and a4d57d27. Removed the lastRunAtms check from the one-shot zombie guard so interrupted retries with lastRunAtms from prior failed attempts are also covered. clearZombieRunningMarkers now returns Set of cleared job IDs, scoped to only those jobs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8eec88a33d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T18:32:21Z

+        if (clearedZombies) {
+          for (const job of state.store?.jobs ?? []) {
+            if (
+              job.schedule.kind === "at" &&
+              typeof job.state.runningAtMs !== "number" &&


Restrict one-shot cleanup to zombie-cleared jobs

This clearedZombies branch iterates over all one-shot jobs and clears nextRunAtMs whenever lastStatus !== "ok", even if that job’s runningAtMs was never a zombie in this tick. In a tick where one unrelated zombie marker is cleared and due.length === 0, this can wipe scheduled one-shot retry backoff (lastStatus: "error" with a future nextRunAtMs) for other jobs, so those retries are later recomputed from the original at time and can run much earlier than intended. The post-maintenance re-clear should be scoped to the specific jobs whose zombie marker was actually cleared.

Useful? React with 👍 / 👎.

Fixed in a4d57d27. clearZombieRunningMarkers now returns Set of cleared job IDs. The post-maintenance re-clear iterates only over those specific jobs, not all one-shots. This prevents wiping scheduled retry backoff for unrelated jobs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4d57d275d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T18:43:21Z

+      if (writeCount === 1 && typeof file === "string" && file.includes("cron")) {
+        throw new Error("simulated total disk failure");


Make repair-failure test actually fail the repair write

This mock only throws on the first cron-store write, which is planStartupCatchup persistence; after that failure, the repair block often has nothing dirty to persist, so the "repair persist also fails" path is never exercised. As written, this test can pass even if the nested recovery persist handling regresses, because it does not force a failure at the repair write it claims to validate.

Useful? React with 👍 / 👎.

Fixed in e339db5 (rebased onto latest main). Test now fails writes 1 and 2 (planStartupCatchup persist + repair persist), allowing write 3+. This exercises both the outer catch (runMissedJobs failure) and the inner catch (repair persist failure). Asserts both error log messages are emitted and the timer is still armed.

When runMissedJobs throws during start(), the timer was never armed, silently killing the scheduler until the next gateway restart. This could happen when an isolated agent turn fails catastrophically during startup catch-up (e.g. provider errors, OOM, etc.). Changes: - Wrap runMissedJobs in try/catch so armTimer always executes - Log the error but keep the scheduler alive - Add zombie runningAtMs detection in onTimer to clear stale markers without requiring a restart (threshold: 2x job timeout or 2 min) - Add tests for both scenarios Fixes openclaw#67854 Related: openclaw#59056 (zombie state regression)

- Fix makeStorePath() usage (await + destructure storePath) - Remove unused CronService import - Rewrite test to actually exercise try/catch by making persist fail - Fix zombie detection to skip jobs with undefined timeout - Add onTimer() zombie detection tests - Add test for unlimited-timeout jobs not being cleared Addresses Greptile review on PR openclaw#68112

onTimer calls ensureLoaded({ forceReload: true }) which reloads from disk, wiping any in-memory changes. Write the state to disk before calling onTimer so the zombie detection actually sees the runningAtMs marker.

- Extract zombie detection logic into a standalone exported function - Replace fragile onTimer integration tests with direct unit tests - Test: clears markers past threshold, preserves within-threshold jobs, skips unlimited-timeout jobs, skips jobs without runningAtMs - onTimer now calls the extracted function

Addresses P1 review feedback: - If planStartupCatchup() persists runningAtMs markers but a later step throws, the catch block now reloads and clears those markers so jobs can run on the very next tick instead of waiting for the runtime zombie detector. - If interruptedOneShotIds were computed but never persisted (because runMissedJobs threw before applyStartupCatchupOutcomes), the catch block now clears nextRunAtMs for those one-shot jobs to prevent double-execution on the first tick. - Added test for one-shot job state repair after catch-up failure. - Updated existing test to assert runningAtMs is cleared after catch.

Previous test only had one one-shot job which was skipped entirely, so runMissedJobs returned early without throwing. Now uses two jobs: - one-shot job (gets interrupted/skipped) - recurring overdue job (becomes catch-up candidate, triggers persist) Fail on write openclaw#3 (applyStartupCatchupOutcomes persist) so runMissedJobs throws and the catch block runs. Verify error was logged instead of checking nextRunAtMs (which recomputeNextRuns may overwrite).

- Add forceReload: true to ensureLoaded in catch block (sees stale on-disk runningAtMs from planStartupCatchup, not in-memory state) - Wrap repair locked() in nested try/catch so repair persist failures don't prevent armTimer from executing - Fix first test write threshold: job has no initial runningAtMs so no stale-marker persist, only 2 writes (plan + apply), fail on openclaw#2 - Add test for repair persist failure: timer still arms even when both runMissedJobs AND repair persist fail

…m re-execution - Remove forceReload from catch block ensureLoaded: if applyStartupCatchupOutcomes already updated in-memory state before its persist threw, reloading would discard those outcomes and make completed jobs runnable again (P1 #3101807738) - In catch block repair: when clearing runningAtMs on one-shot jobs, also clear nextRunAtMs to prevent re-execution - In clearZombieRunningMarkers (timer.ts): same one-shot protection - clear nextRunAtMs alongside runningAtMs for at-kind jobs (P2 #3101807746)

recomputeNextRuns in the final locked() block repopulates nextRunAtMs for one-shot jobs (kind=at) where lastStatus !== ok, undoing the catch block's clearing. Fix: after recomputeNextRuns, re-clear nextRunAtMs for jobs in interruptedOneShotIds.

Previous mock failed writes 2+ but allowed write 1 (planStartupCatchup). After planStartupCatchup succeeded and execution completed, applyStartupCatchupOutcomes updated in-memory state (cleared runningAtMs) before its persist threw. The repair block found no dirty markers and never called persist, so the inner catch never fired. Fix: fail writes 1 and 2 (planStartupCatchup + repair), allow write 3+ (final block). planStartupCatchup persist fails -> runMissedJobs throws -> markers remain in memory -> repair finds dirty markers -> persist also fails -> inner catch fires.

The final locked block calls persist() before armTimer(), so if too many writes fail, start() throws before the timer is armed. Simplified the test to only fail the first write (planStartupCatchup) which triggers the outer catch and the repair path. Subsequent writes succeed so the timer gets armed. The nested try/catch guard in the repair path is still tested by this scenario.

…cleared one-shots clearZombieRunningMarkers clears nextRunAtMs for one-shot jobs, but recomputeNextRunsForMaintenance in the empty-due branch repopulates it (kind=at jobs where lastStatus != ok always return atMs). Fix: after maintenance recompute, re-clear nextRunAtMs for one-shot jobs that were zombie-cleared and haven't been legitimately re-scheduled.

The guard was excluding interrupted one-shot retries that had a lastRunAtMs from a prior failed attempt. This allowed recomputeNextRunsForMaintenance to repopulate nextRunAtMs for those jobs, making them runnable again. Fix: only check lastStatus !== ok, not lastRunAtMs.

If the final locked block's persist throws (e.g. transient I/O after runMissedJobs was already caught), start() would exit before arming the timer. Now the persist is best-effort with error logging, and armTimer always executes.

…eared Changed clearZombieRunningMarkers to return Set<string> of cleared job IDs instead of boolean. The post-maintenance re-clear now only iterates over jobs in that set, preventing unrelated one-shot retries from having their nextRunAtMs wiped. Also wraps final startup persist in try/catch so armTimer always runs.

Fails writes 1 and 2 (planStartupCatchup + repair persist), allowing write 3+ (final block). This exercises both the outer catch (runMissedJobs failure) and the inner catch (repair persist failure). Asserts both error paths are logged and timer is still armed.

…rkers The inner catch is hard to trigger reliably because write counting is fragile with the locked() serialization. Instead, verify the observable behavior: outer catch fires, timer arms, and the stale runningAtMs marker is cleared by the repair block.

alexanderxfgl-bit · 2026-04-17T19:40:00Z

Ready for Review

All automated review feedback addressed across multiple iterations. CI is fully green:

check ✅
check-additional ✅
checks-node-core-runtime ✅ (8 tests, all passing)

What this fixes

The root cause of #67854: runMissedJobs() in start() can throw (e.g. persist failure during catch-up), which prevents armTimer() from ever executing. The scheduler silently dies until the next gateway restart.

Changes

ops.ts: Wrap runMissedJobs() in try/catch so armTimer() always executes. Repair partial state (stale runningAtMs markers, interrupted one-shot jobs) in the catch block. Guard recomputeNextRuns from undoing the repair.
timer.ts: Extract clearZombieRunningMarkers() for runtime zombie detection. Protect one-shot jobs from re-execution after zombie cleanup. Return Set<string> of cleared IDs for scoped post-maintenance cleanup.
Tests: 8 new tests covering startup error resilience, stale marker cleanup, one-shot protection, zombie detection with various timeout configs.

This is the same class of issue as #60799 and #59056.

@steipete @WarrenJones - would appreciate a look when you have a moment.

clawsweeper · 2026-04-27T04:53:28Z

Codex review: needs real behavior proof before merge. Reviewed June 7, 2026, 1:03 AM ET / 05:03 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

PR surface: Source +158, Tests +409. Total +567 across 3 files.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Overall: 🌊 off-meta tidepool
Proof: 🌊 off-meta tidepool
Patch quality: 🌊 off-meta tidepool
Result: rating does not apply to this item.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

[P1] No close action taken because the review did not complete.

Maintainer options:

Decide the mitigation before merge
Retry the Codex review after fixing the execution failure.
Pause or close
Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

[P1] Review did not complete, so no work-lane recommendation was made.

Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1d2bebbb41bf.

Label changes

Label changes:

remove P1: Current review triage priority is none.
remove merge-risk: 🚨 compatibility: Current PR review selected no merge-risk labels.
remove merge-risk: 🚨 message-delivery: Current PR review selected no merge-risk labels.

Label justifications:

rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.

Evidence reviewed

PR surface:

Source +158, Tests +409. Total +567 across 3 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	2	164	6	+158
Tests	1	409	0	+409
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	3	573	6	+567

What I checked:

failure reason: codex execution failed.
codex failure detail: Codex review failed for this PR with exit 1.

Likely related people:

unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

clawsweeper · 2026-05-19T23:34:04Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

openclaw-barnacle · 2026-06-06T04:49:16Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

openclaw-barnacle Bot added the size: S label Apr 17, 2026

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

openclaw-barnacle Bot added size: M and removed size: S labels Apr 17, 2026

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

alexanderxfgl-bit force-pushed the fix/cron-start-timer-arm-guard branch from 1dc05ce to cd0f787 Compare April 17, 2026 15:34

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

openclaw-barnacle Bot added size: L and removed size: M labels Apr 17, 2026

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

alexanderxfgl-bit added 5 commits April 17, 2026 18:49

fix: remove unused @ts-expect-error directive

ebb77ea

fix: flush store to disk before onTimer tests

ddd24a0

onTimer calls ensureLoaded({ forceReload: true }) which reloads from disk, wiping any in-memory changes. Write the state to disk before calling onTimer so the zombie detection actually sees the runningAtMs marker.

alexanderxfgl-bit added 12 commits April 17, 2026 18:49

alexanderxfgl-bit force-pushed the fix/cron-start-timer-arm-guard branch from a4d57d2 to e339db5 Compare April 17, 2026 18:49

SkywingsWang mentioned this pull request Apr 28, 2026

fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death #73355

Open

clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 19, 2026

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 19, 2026

openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Jun 6, 2026

openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Jun 7, 2026

		if (writeCount === 1 && typeof file === "string" && file.includes("cron")) {
		throw new Error("simulated total disk failure");

Uh oh!

Conversation

alexanderxfgl-bit commented Apr 17, 2026

Summary

Root Cause

Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexanderxfgl-bit commented Apr 17, 2026

CI Status Update

✅ My changes pass: checks-node-core-runtime (all 6 tests green)

❌ Pre-existing upstream failures (also failing on latest main 1fad8efa)

Changes in this push (rebased + fixed tests)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

✅ My changes pass: `checks-node-core-runtime` (all 6 tests green)

❌ Pre-existing upstream failures (also failing on latest main `1fad8efa`)