Skip to content

cron: transient 'lost' marker on long-running manual runs before sweeper recovery #78233

@Feelw00

Description

@Feelw00

Summary

Manual cron runs that exceed TASK_RECONCILE_GRACE_MS (5 min) surface a transient lost task-registry status and a Background task lost system message in the grace window before resolveDurableCronTaskRecovery (1fae716) reconciles. The final result is correct, but the user who triggered the run sees an inaccurate intermediate state.

Where

  • src/cron/service/ops.ts prepareManualRun / finishPreparedManualRun — task is created via tryCreateManualTaskRun but markCronJobActive is never called, so task-registry.maintenance.ts hasBackingSession returns false for the cron branch under isCronRuntimeAuthoritative()=true.
  • Compare with runDueJob / executeJob (timer.ts), which were wired to markCronJobActive/clearCronJobActive by fix: honor exec approval security and clean up stale tasks #60310.

Repro

  1. Schedule a cron job with payload.kind: \"agentTurn\" (force mode → up to 60-min timeout) or any manual-run path that takes longer than 5 minutes.
  2. Trigger via `openclaw cron run --force` (or RPC equivalent).
  3. After ~5 min: task-registry sweeper marks the active task `lost`, `Background task lost` system message is emitted to the session.
  4. Once the run completes, `applyJobResult` updates `lastRunStatus`; the next maintenance tick reconciles via `resolveDurableCronTaskRecovery`.

Why this is narrower than #68191

#68191 covers the broader durable-recovery story already addressed by 1fae716. This issue is the residual UX gap on the manual-run path: producer-side `markCronJobActive`/`clearCronJobActive` would prevent the transient lost state entirely, complementing the sweeper-side recovery already in main.

Possible directions (not prescriptive — happy to defer to maintainer preference)

  • Mirror the `markCronJobActive` / `clearCronJobActive` pair into `prepareManualRun` + `finishPreparedManualRun` (try/finally), matching the contract on `runDueJob` / `executeJob`.
  • Or treat as wontfix if the transient lost surface is acceptable post-`1fae716a04`.

Severity

Low — final state is correct; only intermediate UX noise on long manual runs.

Why filing as issue (not PR)

Closed prior PR #71040 because its primary scenario (isolated agentTurn startup catchup) was superseded by `deferAgentTurnJobs:true` (7877182) and `1fae716a04`. This narrower scope deserves maintainer triage before another PR.

[AI-assisted analysis]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions