Skip to content

fix(cron): mark active-jobs on manual-run path to suppress transient lost marker#78243

Merged
steipete merged 7 commits into
openclaw:mainfrom
Feelw00:fix/cron-manual-run-mark-clear
May 11, 2026
Merged

fix(cron): mark active-jobs on manual-run path to suppress transient lost marker#78243
steipete merged 7 commits into
openclaw:mainfrom
Feelw00:fix/cron-manual-run-mark-clear

Conversation

@Feelw00

@Feelw00 Feelw00 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: Manual cron runs (openclaw cron run <id> --force and the matching RPC / agent-tool surface) that exceed TASK_RECONCILE_GRACE_MS (5 min) surface a transient lost task-registry status plus a Background task lost system message in the grace window before resolveDurableCronTaskRecovery (1fae716, merged 2026-04-26) reconciles. Final state is correct, but the user triggering the run sees an inaccurate intermediate state.
  • Why it matters: Force-mode agentTurn manual runs can legitimately run up to AGENT_TURN_SAFETY_TIMEOUT_MS (60 min); the default-mode timeout is 10 min — both well past the 5-minute grace.
  • What changed: Mirror the markCronJobActive/clearCronJobActive pair (introduced for runDueJob/executeJob by fix: honor exec approval security and clean up stale tasks #60310) into prepareManualRun + finishPreparedManualRun, so task-registry.maintenance.ts hasBackingSession returns true for the duration of the manual run.
  • What did NOT change: runStartupCatchupCandidate is intentionally untouched — deferAgentTurnJobs:true (7877182) reroutes long-running startup catchups to runDueJob/executeJob (already wired), and the merged 1fae716a04 covers the residual non-agentTurn case from a different axis. No public API surface change.

Change Type

  • Bug fix

Scope

  • Touched: src/cron/service/ops.ts (3 lines added — import + markCronJobActive + try/finally wrap with clearCronJobActive)
  • Test: src/cron/active-jobs-manual-run.test.ts (new, 160 lines, 2 cases)

Linked Issue

Fixes #78233

Root Cause

task-registry.maintenance.ts hasBackingSession for the cron branch under isCronRuntimeAuthoritative()=true (production default) depends solely on isCronJobActive(jobId). prepareManualRun (ops.ts:654) creates a manual tryCreateManualTaskRun task but never calls markCronJobActive; finishPreparedManualRun (ops.ts:702) likewise never clears it. With TASK_RECONCILE_GRACE_MS = 5 min and force-mode manual runs reaching up to 60 min, the sweeper marks the still-running task lost and emits the Background task lost system message before the run completes.

Regression Test Plan

  • New test file: src/cron/active-jobs-manual-run.test.ts — two cases against the production hot-path (cron.run("<id>", "force") direct invocation, no internal-API rerouting):
    1. Success path: assert isCronJobActive transitions true (mid-run) → false (after run resolves).
    2. Inner-throw path: assert isCronJobActive is also cleared by the finally block when the inner agent run rejects.
  • Existing tests (src/cron/service.test.ts, src/cron/service.restart-catchup.test.ts) untouched.

Security Impact

None. Internal cron service state only — no auth, secrets, network, or cross-process surfaces. No CODEOWNERS-restricted paths.

Repro + Verification

  • Local: pnpm vitest run src/cron/active-jobs-manual-run.test.ts → 2/2 pass
  • Type check: pnpm tsgo:core:test → exit 0
  • Pre-fix behaviour reproducible by reverting the ops.ts change in this PR; the new test then fails because isCronJobActive returns false mid-run.

Evidence

Failing test before fix (locally verified by reverting just the ops.ts change): both new cases fail at the mid-run expect(isCronJobActive(...)).toBe(true) assertion. Passing after fix: both green.

Human Verification

Code paths traced: prepareManualRuntryCreateManualTaskRuncron.runfinishPreparedManualRunapplyJobResult. Confirmed against task-registry.maintenance.ts hasBackingSession (cron branch) and resolveCronJobStateRecovery (lastRunAtMs matching).

Review Conversations

This is a narrower follow-up to closed PR #71040 (which also touched runStartupCatchupCandidate). The full PR was closed after a 5-agent pre-PR cross-review found that 1fae716a04 had already addressed the same axis from the sweeper side, and deferAgentTurnJobs:true removed the agentTurn startup-catchup scenario entirely. This PR retains only the producer-side gap that 1fae716a04 does not close: the transient lost marker during the 5-minute grace window before sweeper recovery reconciles.

Compatibility / Migration

None. Internal-only API mirroring an existing contract.

Risks and Mitigations

  • Concurrent manual runs: prepareManualRun already prevents re-entry (runningAtMs guard); even if mark/clear were called twice for the same jobId the activeJobIds set is idempotent.
  • Inner throws: covered by try/finally and the dedicated regression test.
  • Interaction with 1fae716a04: orthogonal — sweeper recovery still applies; this fix only suppresses the transient lost surface inside the grace window.

[AI-assisted, fully tested]

Real behavior proof

  • Behavior or issue addressed: Without this patch, manual cron runs that exceed TASK_RECONCILE_GRACE_MS (5 min) are marked lost by task-registry.maintenance.ts hasBackingSession (cron branch under isCronRuntimeAuthoritative()=true) because no producer-side markCronJobActive is called from prepareManualRun. With this patch, the same long-running manual run keeps isCronJobActive=true for the duration and is never marked lost by the sweeper.

  • Real environment tested: macOS 25.4 (darwin arm64), Node 23.9, OpenClaw 2026.5.5 from this branch, locally built. Local gateway (openclaw gateway run, loopback, port 18789) configured with openclaw onboard --non-interactive --accept-risk --mode local --auth-choice skip, then OpenAI Codex OAuth bound via openclaw models auth login --provider openai-codex (ChatGPT Plus subscription, profile openai-codex:<email>, agentRuntime: codex, agents.defaults.model.primary: openai-codex/gpt-5.5). Same gateway, same cron job, same codex auth across both builds; only src/cron/service/ops.ts differs.

  • Exact steps or command run after this patch:

    $ pnpm build                                       # build this branch (with mark/clear)
    $ openclaw cron add --name agentturn-mark-clear-demo \
        --every 1h \
        --message "Run the shell command 'sleep 420 && echo done' using your exec tool, then reply 'completed'." \
        --session isolated --thinking high --timeout-seconds 900 \
        --tools exec --model openai-codex/gpt-5.5
    $ openclaw cron run <job-id>
    # then wait > 5 min and observe sqlite task_runs
    $ sqlite3 ~/.openclaw/tasks/runs.sqlite \
        "SELECT task_id, status, started_at, ended_at, error FROM task_runs \
         WHERE source_id='<job-id>' ORDER BY started_at DESC LIMIT 1"
    

    For the without-fix comparison, the same steps were run on a build where src/cron/service/ops.ts was reverted to the base commit (git checkout ea391c6df2 -- src/cron/service/ops.ts) before pnpm build.

  • Evidence after fix: live task_runs rows from ~/.openclaw/tasks/runs.sqlite (same cron job, two builds).

    # Build A — without this patch (ops.ts at base, no markCronJobActive in prepareManualRun)
    task_id     : 3084baa3-b848-4f6c-be8a-9e1a21e925f3
    status      : lost
    started_at  : 1778047111346  (14:58:30)
    ended_at    : 1778047469142  (15:04:29, T0+5m59s)
    last_event_at: 1778047469142
    error       : backing session missing
    
    # Build B — with this patch (ops.ts with markCronJobActive + try/finally clearCronJobActive)
    task_id     : c583e334-c863-4da7-a637-acd11cdb8393
    status      : failed                       <-- not 'lost'
    started_at  : 1778046431855  (14:47:11)
    ended_at    : 1778046871009  (14:54:31, T0+7m20s)
    last_event_at: 1778046871009
    error       : Channel is required (no configured channels detected)...   <-- unrelated delivery config, the run itself finalized normally
    

    Gateway log excerpts (cron + agent activity):

    # Build B (with patch) — codex agent stalled at the ChatGPT rate limit, sweeper grace exceeded
    [diagnostic] stalled session: ... age=131s ... lastProgress=codex_app_server:notification:account/rateLimits/updated cronJob="agentturn-mark-clear-demo"
    [diagnostic] stalled session: ... age=161s
    [diagnostic] stalled session: ... age=191s
    [diagnostic] stalled session: ... age=221s
    [diagnostic] stalled session: ... age=251s
    [diagnostic] stalled session: ... age=281s
    # No "Background task lost" message emitted. Run finalized as 'failed' (delivery channel unrelated) without ever flipping to 'lost'.
    
  • Observed result after fix: With the patch, task_runs.status for a 7-minute manual cron run stays out of lost and finalizes as failed/succeeded directly. Without the patch, task-registry.maintenance.ts hasBackingSession returns false at the first sweeper tick after TASK_RECONCILE_GRACE_MS (5 min) and marks the still-running cron task lost with error="backing session missing". The only difference between the two runs above is the four-line ops.ts change in this PR; cron job, codex auth, gateway config, and ChatGPT subscription are identical.

  • What was not tested: The full chain from Background task lost system message to its retroactive reconciliation by resolveDurableCronTaskRecovery (1fae716a04) was not visually exercised in this setup, since the Codex-driven runs hit the ChatGPT subscription rate limit before normal completion. The producer-side mark/clear gap that this PR closes is fully exercised: the lost status flip happens (without patch) or is suppressed (with patch) deterministically at TASK_RECONCILE_GRACE_MS, which is the user-visible step the patch targets.

@openclaw-barnacle openclaw-barnacle Bot added size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@clawsweeper

clawsweeper Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

Summary
The branch wires active-job mark/clear bookkeeping into manual cron runs, adds success and inner-error regression coverage, updates the changelog, and changes one Codex test sort assertion to toSorted().

Reproducibility: yes. Current-main source shows the manual-run producer path creates a cron task without setting the active-job marker while task maintenance relies on isCronJobActive(jobId), and the PR body supplies before/after live sqlite and gateway-log proof for the long-running manual run case.

Real behavior proof
Sufficient (terminal): The PR body includes terminal-style before/after sqlite output and gateway logs showing the unpatched long manual run becoming lost and the patched run avoiding that transient state.

Next step before merge
No repair lane is needed because the PR already contains the focused code change, regression coverage, and sufficient real behavior proof; remaining work is normal maintainer merge and CI handling.

Security
Cleared: The diff changes internal cron bookkeeping, tests, changelog text, and one test-only sort assertion; it does not alter dependencies, workflows, secrets, permissions, network access, or package resolution.

Review details

Best possible solution:

Land this PR or an equivalent patch after exact-head CI is green, then close #78233 as implemented.

Do we have a high-confidence way to reproduce the issue?

Yes. Current-main source shows the manual-run producer path creates a cron task without setting the active-job marker while task maintenance relies on isCronJobActive(jobId), and the PR body supplies before/after live sqlite and gateway-log proof for the long-running manual run case.

Is this the best way to solve the issue?

Yes. Reusing the existing markCronJobActive/clearCronJobActive contract on the missing manual-run path is the narrowest maintainable fix and avoids API, config, dependency, or policy changes.

What I checked:

  • Current main manual-run gap: prepareManualRun creates a manual cron task run and returns a prepared execution object without calling markCronJobActive, while finishPreparedManualRun has no active-marker cleanup path on current main. (src/cron/service/ops.ts:684, ee5b06f9fe99)
  • Task maintenance liveness contract: For authoritative cron runtime tasks, hasBackingSession uses isCronJobActive(jobId) as the backing-session signal, so a long manual run without that marker can be treated as missing a backing session after the grace window. (src/tasks/task-registry.maintenance.ts:447, ee5b06f9fe99)
  • Existing scheduled-run precedent: Scheduled cron execution already calls markCronJobActive(job.id) when a due job starts and clears the marker through applyOutcomeToStoredJob, which is the producer-side contract this PR extends to manual runs. (src/cron/service/timer.ts:1039, ee5b06f9fe99)
  • PR implementation: The PR imports markCronJobActive/clearCronJobActive, marks the job after manual task creation, and wraps finishPreparedManualRun in a finally that clears the marker. (src/cron/service/ops.ts:687, 88cbbc84c1c3)
  • Regression coverage: The added test drives cron.run(..., "force") and asserts the active marker is true while the manual run is in flight and false after both success and inner failure paths. (src/cron/active-jobs-manual-run.test.ts:38, 88cbbc84c1c3)
  • Real behavior proof: The PR body includes terminal-style live sqlite rows and gateway log excerpts comparing the same long manual cron run before and after the patch: unpatched becomes lost; patched stays out of lost and finalizes without the transient marker. (88cbbc84c1c3)

Likely related people:

  • steipete: Authored or committed the active cron task-maintenance contract, durable cron recovery, startup cron deferral, and earlier manual cron timeout work touching the same cron/task-registry surface. (role: introduced behavior and recent area contributor; confidence: high; commits: 7d1575b5df79, 1fae716a04e5, 7877182b6f59; files: src/cron/active-jobs.ts, src/cron/service/timer.ts, src/cron/service/ops.ts)
  • lml2468: Opened the merged pull request credited by commit 7d1575b for the active cron task-maintenance contract that this PR extends to the manual-run path. (role: related feature contributor; confidence: high; commits: 7d1575b5df79, 19036ef3948b; files: src/cron/active-jobs.ts, src/cron/service/timer.ts, src/tasks/task-registry.maintenance.ts)
  • yasumorishima: Authored a recent merged fix in src/cron/service/ops.ts for stale runningAtMs handling in cron.run(), overlapping the manual-run preflight path reviewed here. (role: recent manual-run path contributor; confidence: medium; commits: be8930d6f9eb; files: src/cron/service/ops.ts, src/cron/service.issue-regressions.test.ts)

Remaining risk / open question:

  • Exact-head GitHub checks were still queued at inspection time, so required CI should finish before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against ee5b06f9fe99.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: supplied External PR includes structured after-fix real behavior proof. labels May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@Feelw00 Feelw00 force-pushed the fix/cron-manual-run-mark-clear branch from 5c4458d to e0f0de3 Compare May 6, 2026 08:10
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@Feelw00 Feelw00 force-pushed the fix/cron-manual-run-mark-clear branch from e0f0de3 to 289d197 Compare May 11, 2026 00:17
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@steipete steipete force-pushed the fix/cron-manual-run-mark-clear branch from 289d197 to ef3f288 Compare May 11, 2026 13:05
steipete added a commit to Feelw00/openclaw that referenced this pull request May 11, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
steipete added a commit to Feelw00/openclaw that referenced this pull request May 11, 2026
@steipete steipete force-pushed the fix/cron-manual-run-mark-clear branch from ef3f288 to 17e681b Compare May 11, 2026 13:11
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
Feelw00 and others added 6 commits May 11, 2026 14:20
…lost marker

Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus
markCronJobActive/clearCronJobActive so task-registry maintenance has a
backing-session signal for runtime='cron' tasks
(task-registry.maintenance.ts hasBackingSession). That patch wired the
pair into runDueJob and executeJob but left the manual-run path
(prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts)
without mark/clear.

Symptom: when a user triggers `openclaw cron run <id> --force` (or any
manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) —
common for force-mode `agentTurn` runs which can reach
AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the
active task `lost` and emits a `Background task lost` system message to
the session, even though the run is still progressing normally.

The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26)
reconciles terminal status retroactively from cron run-log +
store.lastRunStatus, but only after the run finishes. This patch
suppresses the transient `lost` marker by adding the producer-side
mark/clear pair, restoring symmetry with runDueJob/executeJob:

  * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun.
  * finishPreparedManualRun: wrap body in try/finally with
    clearCronJobActive(jobId).

Scope intentionally narrower than openclaw#71040 (closed):
  * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true`
    policy added in 7877182 reroutes long-running startup catchups to
    runDueJob/executeJob (already wired). Non-agentTurn startup catchups
    are theoretical hot-path per pre-PR cross-review.
  * No change to active-jobs.ts API.

Regression test (src/cron/active-jobs-manual-run.test.ts): two cases
exercising production hot-path via cron.run("<id>", "force") — success
and inner-throw — assert isCronJobActive transitions true→false around
the manual run.

Fixes openclaw#78233.

[AI-assisted, fully tested]
@steipete steipete force-pushed the fix/cron-manual-run-mark-clear branch from 17e681b to a537cd6 Compare May 11, 2026 13:21
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@steipete steipete merged commit c1b59a9 into openclaw:main May 11, 2026
91 of 93 checks passed
@steipete

Copy link
Copy Markdown
Contributor

Landed in c1b59a9.

Best-fix review: this is the right scoped fix. Scheduled cron runs already marked active in the registry; the forced/manual run path created a task run without marking it active, so long manual runs could look transiently lost before durable recovery reconciled. Keeping the mark/clear in the manual run path matches the existing scheduled-run lifecycle without adding startup/catchup refactor risk.

Verification:

Thanks @Feelw00.

@Feelw00

Feelw00 commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @steipete

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
Summary:
- Mark forced/manual cron runs active in the task registry until completion and clear them in finally.
- Add regression coverage for manual run success and failure cleanup.
- Update changelog for openclaw#78243 and apply a small lint-only test fix needed after rebasing on latest main.

Fixes openclaw#78233

Verification:
- pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts
- pnpm run lint:extensions:bundled
- pnpm test extensions/codex/src/app-server/side-question.test.ts
- CI: https://github.com/openclaw/openclaw/actions/runs/25673031776

Co-authored-by: Feelw00 <dhrtn1006@naver.com>
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
Summary:
- Mark forced/manual cron runs active in the task registry until completion and clear them in finally.
- Add regression coverage for manual run success and failure cleanup.
- Update changelog for openclaw#78243 and apply a small lint-only test fix needed after rebasing on latest main.

Fixes openclaw#78233

Verification:
- pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts
- pnpm run lint:extensions:bundled
- pnpm test extensions/codex/src/app-server/side-question.test.ts
- CI: https://github.com/openclaw/openclaw/actions/runs/25673031776

Co-authored-by: Feelw00 <dhrtn1006@naver.com>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Summary:
- Mark forced/manual cron runs active in the task registry until completion and clear them in finally.
- Add regression coverage for manual run success and failure cleanup.
- Update changelog for openclaw#78243 and apply a small lint-only test fix needed after rebasing on latest main.

Fixes openclaw#78233

Verification:
- pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts
- pnpm run lint:extensions:bundled
- pnpm test extensions/codex/src/app-server/side-question.test.ts
- CI: https://github.com/openclaw/openclaw/actions/runs/25673031776

Co-authored-by: Feelw00 <dhrtn1006@naver.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extensions: codex proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cron: transient 'lost' marker on long-running manual runs before sweeper recovery

2 participants