fix(cron): mark active-jobs on manual-run path to suppress transient lost marker by Feelw00 · Pull Request #78243 · openclaw/openclaw

Feelw00 · 2026-05-06T03:16:38Z

Summary

Problem: Manual cron runs (openclaw cron run <id> --force and the matching RPC / agent-tool surface) that exceed TASK_RECONCILE_GRACE_MS (5 min) surface a transient lost task-registry status plus a Background task lost system message in the grace window before resolveDurableCronTaskRecovery (1fae716, merged 2026-04-26) reconciles. Final state is correct, but the user triggering the run sees an inaccurate intermediate state.
Why it matters: Force-mode agentTurn manual runs can legitimately run up to AGENT_TURN_SAFETY_TIMEOUT_MS (60 min); the default-mode timeout is 10 min — both well past the 5-minute grace.
What changed: Mirror the markCronJobActive/clearCronJobActive pair (introduced for runDueJob/executeJob by fix: honor exec approval security and clean up stale tasks #60310) into prepareManualRun + finishPreparedManualRun, so task-registry.maintenance.ts hasBackingSession returns true for the duration of the manual run.
What did NOT change: runStartupCatchupCandidate is intentionally untouched — deferAgentTurnJobs:true (7877182) reroutes long-running startup catchups to runDueJob/executeJob (already wired), and the merged 1fae716a04 covers the residual non-agentTurn case from a different axis. No public API surface change.

Change Type

Bug fix

Scope

Touched: src/cron/service/ops.ts (3 lines added — import + markCronJobActive + try/finally wrap with clearCronJobActive)
Test: src/cron/active-jobs-manual-run.test.ts (new, 160 lines, 2 cases)

Linked Issue

Fixes #78233

Root Cause

task-registry.maintenance.ts hasBackingSession for the cron branch under isCronRuntimeAuthoritative()=true (production default) depends solely on isCronJobActive(jobId). prepareManualRun (ops.ts:654) creates a manual tryCreateManualTaskRun task but never calls markCronJobActive; finishPreparedManualRun (ops.ts:702) likewise never clears it. With TASK_RECONCILE_GRACE_MS = 5 min and force-mode manual runs reaching up to 60 min, the sweeper marks the still-running task lost and emits the Background task lost system message before the run completes.

Regression Test Plan

New test file: src/cron/active-jobs-manual-run.test.ts — two cases against the production hot-path (cron.run("<id>", "force") direct invocation, no internal-API rerouting):
1. Success path: assert isCronJobActive transitions true (mid-run) → false (after run resolves).
2. Inner-throw path: assert isCronJobActive is also cleared by the finally block when the inner agent run rejects.
Existing tests (src/cron/service.test.ts, src/cron/service.restart-catchup.test.ts) untouched.

Security Impact

None. Internal cron service state only — no auth, secrets, network, or cross-process surfaces. No CODEOWNERS-restricted paths.

Repro + Verification

Local: pnpm vitest run src/cron/active-jobs-manual-run.test.ts → 2/2 pass
Type check: pnpm tsgo:core:test → exit 0
Pre-fix behaviour reproducible by reverting the ops.ts change in this PR; the new test then fails because isCronJobActive returns false mid-run.

Evidence

Failing test before fix (locally verified by reverting just the ops.ts change): both new cases fail at the mid-run expect(isCronJobActive(...)).toBe(true) assertion. Passing after fix: both green.

Human Verification

Code paths traced: prepareManualRun → tryCreateManualTaskRun → cron.run → finishPreparedManualRun → applyJobResult. Confirmed against task-registry.maintenance.ts hasBackingSession (cron branch) and resolveCronJobStateRecovery (lastRunAtMs matching).

Review Conversations

This is a narrower follow-up to closed PR #71040 (which also touched runStartupCatchupCandidate). The full PR was closed after a 5-agent pre-PR cross-review found that 1fae716a04 had already addressed the same axis from the sweeper side, and deferAgentTurnJobs:true removed the agentTurn startup-catchup scenario entirely. This PR retains only the producer-side gap that 1fae716a04 does not close: the transient lost marker during the 5-minute grace window before sweeper recovery reconciles.

Compatibility / Migration

None. Internal-only API mirroring an existing contract.

Risks and Mitigations

Concurrent manual runs: prepareManualRun already prevents re-entry (runningAtMs guard); even if mark/clear were called twice for the same jobId the activeJobIds set is idempotent.
Inner throws: covered by try/finally and the dedicated regression test.
Interaction with 1fae716a04: orthogonal — sweeper recovery still applies; this fix only suppresses the transient lost surface inside the grace window.

[AI-assisted, fully tested]

Real behavior proof

Behavior or issue addressed: Without this patch, manual cron runs that exceed TASK_RECONCILE_GRACE_MS (5 min) are marked lost by task-registry.maintenance.ts hasBackingSession (cron branch under isCronRuntimeAuthoritative()=true) because no producer-side markCronJobActive is called from prepareManualRun. With this patch, the same long-running manual run keeps isCronJobActive=true for the duration and is never marked lost by the sweeper.
Real environment tested: macOS 25.4 (darwin arm64), Node 23.9, OpenClaw 2026.5.5 from this branch, locally built. Local gateway (openclaw gateway run, loopback, port 18789) configured with openclaw onboard --non-interactive --accept-risk --mode local --auth-choice skip, then OpenAI Codex OAuth bound via openclaw models auth login --provider openai-codex (ChatGPT Plus subscription, profile openai-codex:<email>, agentRuntime: codex, agents.defaults.model.primary: openai-codex/gpt-5.5). Same gateway, same cron job, same codex auth across both builds; only src/cron/service/ops.ts differs.

Exact steps or command run after this patch:

$ pnpm build                                       # build this branch (with mark/clear)
$ openclaw cron add --name agentturn-mark-clear-demo \
    --every 1h \
    --message "Run the shell command 'sleep 420 && echo done' using your exec tool, then reply 'completed'." \
    --session isolated --thinking high --timeout-seconds 900 \
    --tools exec --model openai-codex/gpt-5.5
$ openclaw cron run <job-id>
# then wait > 5 min and observe sqlite task_runs
$ sqlite3 ~/.openclaw/tasks/runs.sqlite \
    "SELECT task_id, status, started_at, ended_at, error FROM task_runs \
     WHERE source_id='<job-id>' ORDER BY started_at DESC LIMIT 1"

For the without-fix comparison, the same steps were run on a build where src/cron/service/ops.ts was reverted to the base commit (git checkout ea391c6df2 -- src/cron/service/ops.ts) before pnpm build.

Evidence after fix: live task_runs rows from ~/.openclaw/tasks/runs.sqlite (same cron job, two builds).

# Build A — without this patch (ops.ts at base, no markCronJobActive in prepareManualRun)
task_id     : 3084baa3-b848-4f6c-be8a-9e1a21e925f3
status      : lost
started_at  : 1778047111346  (14:58:30)
ended_at    : 1778047469142  (15:04:29, T0+5m59s)
last_event_at: 1778047469142
error       : backing session missing

# Build B — with this patch (ops.ts with markCronJobActive + try/finally clearCronJobActive)
task_id     : c583e334-c863-4da7-a637-acd11cdb8393
status      : failed                       <-- not 'lost'
started_at  : 1778046431855  (14:47:11)
ended_at    : 1778046871009  (14:54:31, T0+7m20s)
last_event_at: 1778046871009
error       : Channel is required (no configured channels detected)...   <-- unrelated delivery config, the run itself finalized normally

Gateway log excerpts (cron + agent activity):

# Build B (with patch) — codex agent stalled at the ChatGPT rate limit, sweeper grace exceeded
[diagnostic] stalled session: ... age=131s ... lastProgress=codex_app_server:notification:account/rateLimits/updated cronJob="agentturn-mark-clear-demo"
[diagnostic] stalled session: ... age=161s
[diagnostic] stalled session: ... age=191s
[diagnostic] stalled session: ... age=221s
[diagnostic] stalled session: ... age=251s
[diagnostic] stalled session: ... age=281s
# No "Background task lost" message emitted. Run finalized as 'failed' (delivery channel unrelated) without ever flipping to 'lost'.

Observed result after fix: With the patch, task_runs.status for a 7-minute manual cron run stays out of lost and finalizes as failed/succeeded directly. Without the patch, task-registry.maintenance.ts hasBackingSession returns false at the first sweeper tick after TASK_RECONCILE_GRACE_MS (5 min) and marks the still-running cron task lost with error="backing session missing". The only difference between the two runs above is the four-line ops.ts change in this PR; cron job, codex auth, gateway config, and ChatGPT subscription are identical.
What was not tested: The full chain from Background task lost system message to its retroactive reconciliation by resolveDurableCronTaskRecovery (1fae716a04) was not visually exercised in this setup, since the Codex-driven runs hit the ChatGPT subscription rate limit before normal completion. The producer-side mark/clear gap that this PR closes is fully exercised: the lost status flip happens (without patch) or is suppressed (with patch) deterministically at TASK_RECONCILE_GRACE_MS, which is the user-visible step the patch targets.

clawsweeper · 2026-05-06T03:19:58Z

Codex review: needs maintainer review before merge.

Summary
The branch wires active-job mark/clear bookkeeping into manual cron runs, adds success and inner-error regression coverage, updates the changelog, and changes one Codex test sort assertion to toSorted().

Reproducibility: yes. Current-main source shows the manual-run producer path creates a cron task without setting the active-job marker while task maintenance relies on isCronJobActive(jobId), and the PR body supplies before/after live sqlite and gateway-log proof for the long-running manual run case.

Real behavior proof
Sufficient (terminal): The PR body includes terminal-style before/after sqlite output and gateway logs showing the unpatched long manual run becoming lost and the patched run avoiding that transient state.

Next step before merge
No repair lane is needed because the PR already contains the focused code change, regression coverage, and sufficient real behavior proof; remaining work is normal maintainer merge and CI handling.

Security
Cleared: The diff changes internal cron bookkeeping, tests, changelog text, and one test-only sort assertion; it does not alter dependencies, workflows, secrets, permissions, network access, or package resolution.

Review details

Best possible solution:

Land this PR or an equivalent patch after exact-head CI is green, then close #78233 as implemented.

Do we have a high-confidence way to reproduce the issue?

Yes. Current-main source shows the manual-run producer path creates a cron task without setting the active-job marker while task maintenance relies on isCronJobActive(jobId), and the PR body supplies before/after live sqlite and gateway-log proof for the long-running manual run case.

Is this the best way to solve the issue?

Yes. Reusing the existing markCronJobActive/clearCronJobActive contract on the missing manual-run path is the narrowest maintainable fix and avoids API, config, dependency, or policy changes.

What I checked:

Current main manual-run gap: prepareManualRun creates a manual cron task run and returns a prepared execution object without calling markCronJobActive, while finishPreparedManualRun has no active-marker cleanup path on current main. (src/cron/service/ops.ts:684, ee5b06f9fe99)
Task maintenance liveness contract: For authoritative cron runtime tasks, hasBackingSession uses isCronJobActive(jobId) as the backing-session signal, so a long manual run without that marker can be treated as missing a backing session after the grace window. (src/tasks/task-registry.maintenance.ts:447, ee5b06f9fe99)
Existing scheduled-run precedent: Scheduled cron execution already calls markCronJobActive(job.id) when a due job starts and clears the marker through applyOutcomeToStoredJob, which is the producer-side contract this PR extends to manual runs. (src/cron/service/timer.ts:1039, ee5b06f9fe99)
PR implementation: The PR imports markCronJobActive/clearCronJobActive, marks the job after manual task creation, and wraps finishPreparedManualRun in a finally that clears the marker. (src/cron/service/ops.ts:687, 88cbbc84c1c3)
Regression coverage: The added test drives cron.run(..., "force") and asserts the active marker is true while the manual run is in flight and false after both success and inner failure paths. (src/cron/active-jobs-manual-run.test.ts:38, 88cbbc84c1c3)
Real behavior proof: The PR body includes terminal-style live sqlite rows and gateway log excerpts comparing the same long manual cron run before and after the patch: unpatched becomes lost; patched stays out of lost and finalizes without the transient marker. (88cbbc84c1c3)

Likely related people:

steipete: Authored or committed the active cron task-maintenance contract, durable cron recovery, startup cron deferral, and earlier manual cron timeout work touching the same cron/task-registry surface. (role: introduced behavior and recent area contributor; confidence: high; commits: 7d1575b5df79, 1fae716a04e5, 7877182b6f59; files: src/cron/active-jobs.ts, src/cron/service/timer.ts, src/cron/service/ops.ts)
lml2468: Opened the merged pull request credited by commit 7d1575b for the active cron task-maintenance contract that this PR extends to the manual-run path. (role: related feature contributor; confidence: high; commits: 7d1575b5df79, 19036ef3948b; files: src/cron/active-jobs.ts, src/cron/service/timer.ts, src/tasks/task-registry.maintenance.ts)
yasumorishima: Authored a recent merged fix in src/cron/service/ops.ts for stale runningAtMs handling in cron.run(), overlapping the manual-run preflight path reviewed here. (role: recent manual-run path contributor; confidence: medium; commits: be8930d6f9eb; files: src/cron/service/ops.ts, src/cron/service.issue-regressions.test.ts)

Remaining risk / open question:

Exact-head GitHub checks were still queued at inspection time, so required CI should finish before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against ee5b06f9fe99.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/25673044716
Updated: 2026-05-11T13:34:20.257Z

…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]

…x comparison)

steipete · 2026-05-11T13:41:06Z

Landed in c1b59a9.

Best-fix review: this is the right scoped fix. Scheduled cron runs already marked active in the registry; the forced/manual run path created a task run without marking it active, so long manual runs could look transiently lost before durable recovery reconciled. Keeping the mark/clear in the manual run path matches the existing scheduled-run lifecycle without adding startup/catchup refactor risk.

Verification:

pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts
pnpm run lint:extensions:bundled
pnpm test extensions/codex/src/app-server/side-question.test.ts
CI run: https://github.com/openclaw/openclaw/actions/runs/25673031776

Thanks @Feelw00.

Feelw00 · 2026-05-12T00:06:35Z

Thanks @steipete

Summary: - Mark forced/manual cron runs active in the task registry until completion and clear them in finally. - Add regression coverage for manual run success and failure cleanup. - Update changelog for openclaw#78243 and apply a small lint-only test fix needed after rebasing on latest main. Fixes openclaw#78233 Verification: - pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts - pnpm run lint:extensions:bundled - pnpm test extensions/codex/src/app-server/side-question.test.ts - CI: https://github.com/openclaw/openclaw/actions/runs/25673031776 Co-authored-by: Feelw00 <dhrtn1006@naver.com>

openclaw-barnacle Bot added size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026

Feelw00 force-pushed the fix/cron-manual-run-mark-clear branch from 5c4458d to e0f0de3 Compare May 6, 2026 08:10

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026

Feelw00 force-pushed the fix/cron-manual-run-mark-clear branch from e0f0de3 to 289d197 Compare May 11, 2026 00:17

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

steipete force-pushed the fix/cron-manual-run-mark-clear branch from 289d197 to ef3f288 Compare May 11, 2026 13:05

steipete added a commit to Feelw00/openclaw that referenced this pull request May 11, 2026

docs: update changelog for manual cron run fix (openclaw#78243)

ef3f288

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

steipete added a commit to Feelw00/openclaw that referenced this pull request May 11, 2026

docs: update changelog for manual cron run fix (openclaw#78243)

98bfa12

steipete force-pushed the fix/cron-manual-run-mark-clear branch from ef3f288 to 17e681b Compare May 11, 2026 13:11

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

Feelw00 and others added 6 commits May 11, 2026 14:20

chore: trigger CI after PR body update with Real behavior proof

9bb74e1

chore: re-trigger CI after Real behavior proof field-name fix

b1e7128

chore: re-run CI after Real behavior proof v2 (with-fix vs without-fi…

0624785

…x comparison)

chore: re-run CI after Real behavior proof v2 header fix

45c6f26

docs: update changelog for manual cron run fix (openclaw#78243)

a537cd6

steipete force-pushed the fix/cron-manual-run-mark-clear branch from 17e681b to a537cd6 Compare May 11, 2026 13:21

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

test: satisfy no-array-sort lint

88cbbc8

openclaw-barnacle Bot added the extensions: codex label May 11, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

steipete merged commit c1b59a9 into openclaw:main May 11, 2026
91 of 93 checks passed

steipete mentioned this pull request May 11, 2026

cron: transient 'lost' marker on long-running manual runs before sweeper recovery #78233

Closed

github-actions Bot mentioned this pull request May 11, 2026

📡 Upstream Digest — 2026-05-11 14:31 UTC curtismercier/openclaw-mods#830

Open

This was referenced May 12, 2026

Ecosystem Digest — 2026-05-12 Haderach-Ram/openclaw-radar#4

Open

Ecosystem Digest — 2026-05-12 Haderach-Ram/openclaw-radar#5

Open

clawsweeper Bot mentioned this pull request May 28, 2026

fix(cron): guard task liveness during startup #80707

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cron): mark active-jobs on manual-run path to suppress transient lost marker#78243

fix(cron): mark active-jobs on manual-run path to suppress transient lost marker#78243
steipete merged 7 commits into
openclaw:mainfrom
Feelw00:fix/cron-manual-run-mark-clear

Feelw00 commented May 6, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

steipete commented May 11, 2026

Uh oh!

Feelw00 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Feelw00 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

Scope

Linked Issue

Root Cause

Regression Test Plan

Security Impact

Repro + Verification

Evidence

Human Verification

Review Conversations

Compatibility / Migration

Risks and Mitigations

Real behavior proof

Uh oh!

clawsweeper Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

steipete commented May 11, 2026

Uh oh!

Feelw00 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feelw00 commented May 6, 2026 •

edited

Loading

clawsweeper Bot commented May 6, 2026 •

edited

Loading