fix(cron): mark active-jobs on manual-run path to suppress transient lost marker#78243
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Current-main source shows the manual-run producer path creates a cron task without setting the active-job marker while task maintenance relies on Real behavior proof Next step before merge Security Review detailsBest possible solution: Land this PR or an equivalent patch after exact-head CI is green, then close #78233 as implemented. Do we have a high-confidence way to reproduce the issue? Yes. Current-main source shows the manual-run producer path creates a cron task without setting the active-job marker while task maintenance relies on Is this the best way to solve the issue? Yes. Reusing the existing What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against ee5b06f9fe99. Re-review progress:
|
5c4458d to
e0f0de3
Compare
e0f0de3 to
289d197
Compare
289d197 to
ef3f288
Compare
ef3f288 to
17e681b
Compare
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
17e681b to
a537cd6
Compare
|
Landed in c1b59a9. Best-fix review: this is the right scoped fix. Scheduled cron runs already marked active in the registry; the forced/manual run path created a task run without marking it active, so long manual runs could look transiently lost before durable recovery reconciled. Keeping the mark/clear in the manual run path matches the existing scheduled-run lifecycle without adding startup/catchup refactor risk. Verification:
Thanks @Feelw00. |
|
Thanks @steipete |
Summary: - Mark forced/manual cron runs active in the task registry until completion and clear them in finally. - Add regression coverage for manual run success and failure cleanup. - Update changelog for openclaw#78243 and apply a small lint-only test fix needed after rebasing on latest main. Fixes openclaw#78233 Verification: - pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts - pnpm run lint:extensions:bundled - pnpm test extensions/codex/src/app-server/side-question.test.ts - CI: https://github.com/openclaw/openclaw/actions/runs/25673031776 Co-authored-by: Feelw00 <dhrtn1006@naver.com>
Summary: - Mark forced/manual cron runs active in the task registry until completion and clear them in finally. - Add regression coverage for manual run success and failure cleanup. - Update changelog for openclaw#78243 and apply a small lint-only test fix needed after rebasing on latest main. Fixes openclaw#78233 Verification: - pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts - pnpm run lint:extensions:bundled - pnpm test extensions/codex/src/app-server/side-question.test.ts - CI: https://github.com/openclaw/openclaw/actions/runs/25673031776 Co-authored-by: Feelw00 <dhrtn1006@naver.com>
Summary: - Mark forced/manual cron runs active in the task registry until completion and clear them in finally. - Add regression coverage for manual run success and failure cleanup. - Update changelog for openclaw#78243 and apply a small lint-only test fix needed after rebasing on latest main. Fixes openclaw#78233 Verification: - pnpm test src/plugin-sdk/channel-streaming.test.ts src/cron/active-jobs-manual-run.test.ts - pnpm run lint:extensions:bundled - pnpm test extensions/codex/src/app-server/side-question.test.ts - CI: https://github.com/openclaw/openclaw/actions/runs/25673031776 Co-authored-by: Feelw00 <dhrtn1006@naver.com>
Summary
openclaw cron run <id> --forceand the matching RPC / agent-tool surface) that exceedTASK_RECONCILE_GRACE_MS(5 min) surface a transientlosttask-registry status plus aBackground task lostsystem message in the grace window beforeresolveDurableCronTaskRecovery(1fae716, merged 2026-04-26) reconciles. Final state is correct, but the user triggering the run sees an inaccurate intermediate state.agentTurnmanual runs can legitimately run up toAGENT_TURN_SAFETY_TIMEOUT_MS(60 min); the default-mode timeout is 10 min — both well past the 5-minute grace.markCronJobActive/clearCronJobActivepair (introduced forrunDueJob/executeJobby fix: honor exec approval security and clean up stale tasks #60310) intoprepareManualRun+finishPreparedManualRun, sotask-registry.maintenance.tshasBackingSessionreturns true for the duration of the manual run.runStartupCatchupCandidateis intentionally untouched —deferAgentTurnJobs:true(7877182) reroutes long-running startup catchups torunDueJob/executeJob(already wired), and the merged1fae716a04covers the residual non-agentTurn case from a different axis. No public API surface change.Change Type
Scope
src/cron/service/ops.ts(3 lines added — import +markCronJobActive+try/finallywrap withclearCronJobActive)src/cron/active-jobs-manual-run.test.ts(new, 160 lines, 2 cases)Linked Issue
Fixes #78233
Root Cause
task-registry.maintenance.tshasBackingSessionfor the cron branch underisCronRuntimeAuthoritative()=true(production default) depends solely onisCronJobActive(jobId).prepareManualRun(ops.ts:654) creates a manualtryCreateManualTaskRuntask but never callsmarkCronJobActive;finishPreparedManualRun(ops.ts:702) likewise never clears it. WithTASK_RECONCILE_GRACE_MS = 5 minand force-mode manual runs reaching up to 60 min, the sweeper marks the still-running tasklostand emits theBackground task lostsystem message before the run completes.Regression Test Plan
src/cron/active-jobs-manual-run.test.ts— two cases against the production hot-path (cron.run("<id>", "force")direct invocation, no internal-API rerouting):isCronJobActivetransitionstrue(mid-run) →false(after run resolves).isCronJobActiveis also cleared by thefinallyblock when the inner agent run rejects.src/cron/service.test.ts,src/cron/service.restart-catchup.test.ts) untouched.Security Impact
None. Internal cron service state only — no auth, secrets, network, or cross-process surfaces. No CODEOWNERS-restricted paths.
Repro + Verification
pnpm vitest run src/cron/active-jobs-manual-run.test.ts→ 2/2 passpnpm tsgo:core:test→ exit 0ops.tschange in this PR; the new test then fails becauseisCronJobActivereturnsfalsemid-run.Evidence
Failing test before fix (locally verified by reverting just the
ops.tschange): both new cases fail at the mid-runexpect(isCronJobActive(...)).toBe(true)assertion. Passing after fix: both green.Human Verification
Code paths traced:
prepareManualRun→tryCreateManualTaskRun→cron.run→finishPreparedManualRun→applyJobResult. Confirmed againsttask-registry.maintenance.tshasBackingSession(cron branch) andresolveCronJobStateRecovery(lastRunAtMsmatching).Review Conversations
This is a narrower follow-up to closed PR #71040 (which also touched
runStartupCatchupCandidate). The full PR was closed after a 5-agent pre-PR cross-review found that1fae716a04had already addressed the same axis from the sweeper side, anddeferAgentTurnJobs:trueremoved the agentTurn startup-catchup scenario entirely. This PR retains only the producer-side gap that1fae716a04does not close: the transient lost marker during the 5-minute grace window before sweeper recovery reconciles.Compatibility / Migration
None. Internal-only API mirroring an existing contract.
Risks and Mitigations
prepareManualRunalready prevents re-entry (runningAtMsguard); even if mark/clear were called twice for the samejobIdthe activeJobIds set is idempotent.try/finallyand the dedicated regression test.1fae716a04: orthogonal — sweeper recovery still applies; this fix only suppresses the transientlostsurface inside the grace window.[AI-assisted, fully tested]
Real behavior proof
Behavior or issue addressed: Without this patch, manual cron runs that exceed
TASK_RECONCILE_GRACE_MS(5 min) are markedlostbytask-registry.maintenance.tshasBackingSession(cron branch underisCronRuntimeAuthoritative()=true) because no producer-sidemarkCronJobActiveis called fromprepareManualRun. With this patch, the same long-running manual run keepsisCronJobActive=truefor the duration and is never markedlostby the sweeper.Real environment tested: macOS 25.4 (darwin arm64), Node 23.9, OpenClaw 2026.5.5 from this branch, locally built. Local gateway (
openclaw gateway run, loopback, port 18789) configured withopenclaw onboard --non-interactive --accept-risk --mode local --auth-choice skip, then OpenAI Codex OAuth bound viaopenclaw models auth login --provider openai-codex(ChatGPT Plus subscription, profileopenai-codex:<email>,agentRuntime: codex,agents.defaults.model.primary: openai-codex/gpt-5.5). Same gateway, same cron job, same codex auth across both builds; onlysrc/cron/service/ops.tsdiffers.Exact steps or command run after this patch:
For the without-fix comparison, the same steps were run on a build where
src/cron/service/ops.tswas reverted to the base commit (git checkout ea391c6df2 -- src/cron/service/ops.ts) beforepnpm build.Evidence after fix: live
task_runsrows from~/.openclaw/tasks/runs.sqlite(same cron job, two builds).Gateway log excerpts (cron + agent activity):
Observed result after fix: With the patch,
task_runs.statusfor a 7-minute manual cron run stays out oflostand finalizes asfailed/succeededdirectly. Without the patch,task-registry.maintenance.tshasBackingSessionreturns false at the first sweeper tick afterTASK_RECONCILE_GRACE_MS(5 min) and marks the still-running cron tasklostwitherror="backing session missing". The only difference between the two runs above is the four-lineops.tschange in this PR; cron job, codex auth, gateway config, and ChatGPT subscription are identical.What was not tested: The full chain from
Background task lostsystem message to its retroactive reconciliation byresolveDurableCronTaskRecovery(1fae716a04) was not visually exercised in this setup, since the Codex-driven runs hit the ChatGPT subscription rate limit before normal completion. The producer-side mark/clear gap that this PR closes is fully exercised: theloststatus flip happens (without patch) or is suppressed (with patch) deterministically atTASK_RECONCILE_GRACE_MS, which is the user-visible step the patch targets.