Skip to content

feat: Deleting scheduled tasks also clears tasks in the queue.#64110

Open
Lutz2015 wants to merge 6 commits intoopenclaw:mainfrom
Lutz2015:fix/cron-job-remove
Open

feat: Deleting scheduled tasks also clears tasks in the queue.#64110
Lutz2015 wants to merge 6 commits intoopenclaw:mainfrom
Lutz2015:fix/cron-job-remove

Conversation

@Lutz2015
Copy link
Copy Markdown

Summary

  • Problem: Deleting a scheduled task did not stop already queued executions, so Feishu channel messages could still be delivered after the task was deleted.
  • Why it matters: This caused unexpected post-deletion notifications, inconsistent scheduler behavior, and reduced user trust because a deleted task was expected to stop all future deliveries.
  • What changed: When a scheduled task is deleted, the system now also removes its queued/pending executions to prevent any further Feishu pushes from that task.
  • What did NOT change (scope boundary): This PR does not change scheduling semantics, delivery logic for active tasks, retry policy, or non-related channel behavior beyond preventing queued executions from firing after deletion.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: Scheduled task deletion removed the task definition, but did not clean up executions that had already been queued by the scheduler.
  • Missing detection / guardrail: There was no deletion-time cleanup for pending/queued jobs tied to the deleted schedule, and no regression coverage verifying that deleted tasks stop downstream channel delivery.
  • Contributing context (if known): The scheduler and delivery pipeline were decoupled, so once a job entered the queue it could still be consumed and delivered even after the parent schedule was deleted.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: Scheduler deletion / queued job cleanup test for scheduled-task lifecycle.
  • Scenario the test should lock in: Create a scheduled task, allow one or more executions to enter the queue, delete the task, then verify queued executions are removed and no Feishu message is sent afterward.
  • Why this is the smallest reliable guardrail: The bug is caused by interaction between schedule deletion and the queued execution pipeline, so an integration-level test is the smallest reliable coverage.
  • Existing test that already covers this (if any): None known.
  • If no new test is added, why not: If no automated test is included in this PR, the cleanup behavior was verified manually against the scheduler queue and Feishu delivery result.

User-visible / Behavior Changes

  • Deleting a scheduled task now also cancels/removes its queued pending executions.
  • After deletion, Feishu channel messages from that deleted task should no longer be delivered.
  • No change for active, non-deleted scheduled tasks.
Before:
[delete scheduled task] -> [task definition removed]
                           [queued job remains] -> [Feishu push still sent]

After:
[delete scheduled task] -> [task definition removed]
                           -> [queued jobs removed]
                           -> [no further Feishu push]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS / Linux
  • Runtime/container: project default runtime
  • Model/provider: N/A
  • Integration/channel (if any): Feishu
  • Relevant config (redacted): scheduler enabled, Feishu delivery configured

Steps

  1. Create a scheduled task configured to send messages to the Feishu channel.
  2. Wait until at least one execution from that task is queued/pending.
  3. Delete the scheduled task before the queued execution is consumed.
  4. Observe whether any Feishu message is still delivered from the deleted task.

Expected

  • After deleting the scheduled task, any queued/pending executions associated with it are also removed.
  • No Feishu message is delivered from that deleted task.

Actual

  • Before this fix, deleting the scheduled task removed only the task definition.
  • Queued executions still ran and Feishu messages were still delivered.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:

    • Created a scheduled task with Feishu delivery.
    • Confirmed executions could enter the queue.
    • Deleted the scheduled task.
    • Verified queued executions were cleared and no further Feishu message was sent.
  • Edge cases checked:
    -Deleting a task with no queued executions.
    -Deleting a task with queued but not yet consumed executions.

  • What you did not verify:

    • Full cross-channel behavior for all integrations.
    • Large-scale concurrency/performance behavior under heavy queue load.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

Risks and Mitigations

  • Risk: Cleanup logic could remove the wrong queued jobs if task-to-queue mapping is too broad.
    Mitigation: Restrict deletion cleanup to queued executions explicitly associated with the deleted scheduled task ID.
  • Risk: Race condition between deletion and queue consumption may allow an in-flight job to execute.
    Mitigation: Best-effort queue cleanup plus validation before delivery/execution if task existence is checked downstream.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

This PR adds a jobExists check after executeJobCoreWithTimeout returns in both the timed (runDueJob in timer.ts) and manual (finishPreparedManualRun in ops.ts) execution paths, marking results as "skipped" when the job was deleted during execution.

  • The check is post-execution: by the time jobExists is called, executeJobCoreWithTimeout has already returned and any Feishu delivery has already occurred. The PR description's claim that "queued executions are removed and no Feishu message is sent afterward" does not match the implementation.
  • The timer-path check is redundant: applyOutcomeToStoredJob already handles deleted jobs gracefully (force-reload + discard if not found) without an extra lock acquisition.
  • The original return statement in runDueJob was commented out instead of deleted, leaving dead code in the committed diff.

Confidence Score: 2/5

Not safe to merge — the fix does not achieve its stated goal and includes commented-out dead code.

Two P1 logic findings: the jobExists check fires after execution and delivery has already completed, so it does not prevent post-deletion Feishu pushes as described; the timer-path check is redundant with existing handling in applyOutcomeToStoredJob. Commented-out dead code is also present.

Both src/cron/service/timer.ts and src/cron/service/ops.ts need attention — the jobExists call placement and the commented-out return block in timer.ts are the primary concerns.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 730-736

Comment:
**Commented-out dead code should be removed**

The original `return` statement was commented out rather than deleted. This is the exact same value produced by the live `return` on line 744 when `jobExists` is true — leaving it here adds noise and misleads future readers into thinking there's an intentional alternate path.

```suggestion
        if (!(await jobExists(state, id))) {
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 737-743

Comment:
**Post-execution check does not prevent Feishu delivery**

`jobExists` is called after `executeJobCoreWithTimeout` has already returned, meaning the job body (including any Feishu message send) has already completed. Returning `status: "skipped"` here only suppresses the post-run state update; it does not cancel or drop the message.

Additionally, this check is redundant for the timed path: `applyOutcomeToStoredJob` (called at line 788) already handles deleted jobs — it does `jobs.find(...)` after a force-reload, logs "job not found after forceReload, result discarded", and returns without persisting anything. The new `jobExists` call simply duplicates that guard while adding an extra lock acquisition and forced disk read on every job completion.

If the goal is to prevent delivery for jobs whose task was deleted *before they start*, the check should be inserted before `executeJobCoreWithTimeout` (inside `runDueJob`, after acquiring the lock to inspect whether the job is still present in the store).

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/ops.ts
Line: 626-632

Comment:
**Same post-execution race: delivery already occurred**

Like the timer path, `jobExists` is checked only after `executeJobCoreWithTimeout` returns — the Feishu message (or whatever the job delivered) has already been sent. Overwriting `coreResult` with `{ status: "skipped" }` prevents the result from being persisted to state, but it does not undo the delivery.

Note that the locked block immediately below (line 643) already guards against this case: if the job was deleted, `state.store?.jobs.find(...)` returns `undefined` and the function returns early, so no double-write would occur even without the new check. The meaningful protection this change adds is suppressing the `tryFinishManualTaskRun` bookkeeping call on line 637 with a `"skipped"` status rather than the real outcome — the actual delivery behaviour is unchanged.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 103-108

Comment:
**`forceReload: true` on every job completion adds unnecessary I/O**

`jobExists` always passes `forceReload: true`, which reads the store from disk on every call. It is invoked once per job completion on both the timed and manual paths. For installations with high-frequency cron jobs this accumulates into constant disk I/O even when no deletion is occurring. Consider checking the in-memory store first and only reloading if the job is not found (or accepting a small consistency window for this best-effort check).

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat: add jobExists" | Re-trigger Greptile

Comment thread src/cron/service/timer.ts Outdated
Comment thread src/cron/service/timer.ts Outdated
Comment on lines +737 to +743
if (!(await jobExists(state, id))) {
state.deps.log.info(
{ jobId: id, jobName: job.name },
"cron: dropping completed run for job removed during execution",
);
return { jobId: id, taskRunId, status: "skipped", startedAt, endedAt: state.deps.nowMs() };
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Post-execution check does not prevent Feishu delivery

jobExists is called after executeJobCoreWithTimeout has already returned, meaning the job body (including any Feishu message send) has already completed. Returning status: "skipped" here only suppresses the post-run state update; it does not cancel or drop the message.

Additionally, this check is redundant for the timed path: applyOutcomeToStoredJob (called at line 788) already handles deleted jobs — it does jobs.find(...) after a force-reload, logs "job not found after forceReload, result discarded", and returns without persisting anything. The new jobExists call simply duplicates that guard while adding an extra lock acquisition and forced disk read on every job completion.

If the goal is to prevent delivery for jobs whose task was deleted before they start, the check should be inserted before executeJobCoreWithTimeout (inside runDueJob, after acquiring the lock to inspect whether the job is still present in the store).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 737-743

Comment:
**Post-execution check does not prevent Feishu delivery**

`jobExists` is called after `executeJobCoreWithTimeout` has already returned, meaning the job body (including any Feishu message send) has already completed. Returning `status: "skipped"` here only suppresses the post-run state update; it does not cancel or drop the message.

Additionally, this check is redundant for the timed path: `applyOutcomeToStoredJob` (called at line 788) already handles deleted jobs — it does `jobs.find(...)` after a force-reload, logs "job not found after forceReload, result discarded", and returns without persisting anything. The new `jobExists` call simply duplicates that guard while adding an extra lock acquisition and forced disk read on every job completion.

If the goal is to prevent delivery for jobs whose task was deleted *before they start*, the check should be inserted before `executeJobCoreWithTimeout` (inside `runDueJob`, after acquiring the lock to inspect whether the job is still present in the store).

How can I resolve this? If you propose a fix, please make it concise.

Comment thread src/cron/service/ops.ts Outdated
Comment thread src/cron/service/timer.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 867caa689e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
// startedAt,
// endedAt: state.deps.nowMs(),
// };
if (!(await jobExists(state, id))) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard deleted jobs before executing due runs

The deletion check happens only after executeJobCoreWithTimeout has finished, so a job removed while waiting in the timer worker queue still runs and can emit side effects (for example, executeMainSessionCronJob enqueues the outbound system event during execution). In that scenario this branch relabels the run as skipped, but the message/work already happened, so deleting a scheduled task does not actually stop pending queued executions.

Useful? React with 👍 / 👎.

Comment thread src/cron/service/ops.ts Outdated
Comment on lines +625 to +626
coreResult = await executeJobCoreWithTimeout(state, executionJob);
if (await jobExists(state, jobId).then((exists) => !exists)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard manual runs before executing deleted jobs

This manual-run path has the same ordering issue: it executes executeJobCoreWithTimeout first and only then checks jobExists. If a manual run is deleted during execution, the side effects still occur and then the result is overwritten to skipped, which means deletion does not prevent in-flight queued/manual delivery and the recorded outcome no longer reflects what actually ran.

Useful? React with 👍 / 👎.

@akai-shuuichi
Copy link
Copy Markdown

关闭 PR - 由 Lukin 指令执行

I think this PR is very useful. Here's why

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2bb4547b9d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts
// Pre-execution check: if the job was removed between the due-job
// collection and now (e.g. by a concurrent `remove()` call), skip
// execution entirely so no delivery occurs.
if (!(await jobExists(state, id))) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle job-existence reload errors in timer workers

runDueJob now calls jobExists before entering the try/catch, but jobExists force-reloads the cron store and can throw on transient read/parse failures (for example during concurrent external file rewrites). In that case the worker rejects, Promise.all(workers) aborts the whole tick, and outcomes from other jobs in the same tick may never be applied, leaving stale running/active state until a later recovery. Treat existence-check failures as per-job error outcomes instead of letting them fail the entire timer pass.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

@Lutz2015 Lutz2015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d856fc01f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts
// Pre-execution check: if the job was removed between the due-job
// collection and now (e.g. by a concurrent `remove()` call), skip
// execution entirely so no delivery occurs.
if (!(await jobExists(state, id))) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Catch existence-check failures inside runDueJob

runDueJob performs await jobExists(...) before entering its try/catch, so any transient store read/parse error from the forced reload rejects the worker promise immediately. Because workers are aggregated with Promise.all, one rejection aborts the tick and skips the outcome-application block, leaving due jobs with persisted runningAtMs markers from the collection phase but without corresponding completion updates; this can strand jobs in a pseudo-running state and prevent future executions until later repair. Fresh evidence here is that the new pre-execution guard introduced at this call site is now the only uncaught path in runDueJob before worker aggregation.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Codex review: needs changes before merge.

What this changes:

The PR adds a cron jobExists helper, pre-execution deletion checks for timed and manual cron runs, and regression tests for skipping removed scheduled jobs.

Required change before merge:

The remaining blockers are narrow and mechanical: handle jobExists failures inside the timer per-job outcome path and add the required changelog entry, then rerun focused cron validation.

Security review:

Security review cleared: Security review cleared: the diff is limited to cron service implementation and tests, with no concrete security or supply-chain regression found.

Review findings:

  • [P1] Catch existence-check failures inside runDueJob — src/cron/service/timer.ts:726
  • [P3] Add the required changelog entry — src/cron/service/ops.ts:625-628
Review details

Best possible solution:

Land a revised cron-service fix that checks deletion after reservation but before delivery side effects for both timer and manual queued runs, treats store reload failures as per-job outcomes, preserves existing post-execution removal semantics, adds focused regression coverage, and includes the required changelog entry.

Do we have a high-confidence way to reproduce the issue?

Yes. A high-confidence source-level reproduction is to collect or queue a cron job, remove its stored job before the worker starts, and observe that current main still executes the captured CronJob snapshot without a pre-execution store recheck.

Is this the best way to solve the issue?

No. The PR’s direction is the narrow maintainable fix, but the timer guard must be moved inside runDueJob’s per-job error handling or otherwise converted into a normal skipped/error outcome so one reload failure cannot abort the whole timer tick.

Full review comments:

  • [P1] Catch existence-check failures inside runDueJob — src/cron/service/timer.ts:726
    The new jobExists guard runs before runDueJob enters its try/catch. If the forced reload throws, Promise.all(workers) rejects and the outcome-application block is skipped, so other completed jobs can keep persisted running markers instead of being finalized. Treat this check as a per-job skipped/error outcome.
    Confidence: 0.88
  • [P3] Add the required changelog entry — src/cron/service/ops.ts:625-628
    This changes user-visible cron deletion behavior, but the PR only updates cron source and tests. OpenClaw policy requires a single-line CHANGELOG.md entry for user-facing fixes before merge, with contributor credit such as Thanks @Lutz2015.
    Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.88

Acceptance criteria:

  • pnpm test src/cron/service/timer.test.ts src/cron/service/ops.test.ts src/cron/service/timer.regression.test.ts src/cron/service/ops.regression.test.ts
  • pnpm check:changed

What I checked:

  • Current remove path only deletes the stored job: remove() loads the cron store, filters state.store.jobs, persists, rearms the timer, and emits removed; it does not cancel work already collected by the timer or queued in the cron command lane. (src/cron/service/ops.ts:412, a256745323ab)
  • Current timer execution uses collected snapshots: onTimer() collects runnable jobs, persists running markers, then each worker calls executeJobCoreWithTimeout(state, job) on the captured job without rechecking whether the job is still in the store before side effects start. (src/cron/service/timer.ts:839, a256745323ab)
  • Manual cron runs are queued before execution: The CLI docs state openclaw cron run returns once the manual run is queued; current enqueueRun() later invokes run(), and finishPreparedManualRun() executes the prepared job snapshot before checking the stored job under lock. Public docs: docs/cli/cron.md. (docs/cli/cron.md:93, a256745323ab)
  • Current post-execution handling is not cancellation: applyOutcomeToStoredJob() only handles missing jobs after execution has completed, including discard/finalization behavior after a force reload; that is after any delivery side effect has already happened. (src/cron/service/timer.ts:702, a256745323ab)
  • PR timer guard can reject outside per-job error handling: The PR adds if (!(await jobExists(state, id))) before runDueJob enters its existing try/catch; jobExists force-reloads the store and can throw, and workers are aggregated with Promise.all, so one reload failure can abort the whole timer tick before completed outcomes are applied. (src/cron/service/timer.ts:726, 6d856fc01f3d)
  • PR surface is cron source and tests only: The fetched PR file list contains src/cron/service/ops.ts, src/cron/service/timer.ts, src/cron/service/ops.test.ts, and src/cron/service/timer.test.ts; no workflows, dependency metadata, lockfiles, release scripts, generated/vendor code, or secrets handling are changed. (6d856fc01f3d)

Likely related people:

  • steipete: Local blame on the central cron timer/remove paths points to Peter Steinberger in the available checkout, and GitHub path history shows repeated recent cron timer, timeout, startup, and skipped-run maintenance commits. (role: recent maintainer; confidence: high; commits: 9300d48244ca, 61d53f98d314, 7877182b6f59; files: src/cron/service/timer.ts, src/cron/service/ops.ts, src/cron/service/timer.regression.test.ts)
  • jalehman: Recent merged cron self-removal work is directly adjacent to jobs being removed during execution and the regression-test area this PR exercises. (role: adjacent owner; confidence: medium; commits: 12c52963ea80; files: src/cron/service/timer.ts, src/cron/service/timer.regression.test.ts)
  • mbelinky: Recent cron scheduling and detached task lifecycle commits touched the same timer/manual execution and task-ledger behavior that can be stranded if a worker pass aborts. (role: adjacent owner; confidence: medium; commits: 0787266637fd, 190a4b48697b, c60282421524; files: src/cron/service/timer.ts, src/cron/service/ops.ts, src/cron/service/timer.regression.test.ts)

Remaining risk / open question:

  • The PR still needs targeted cron tests and the changed gate after the timer existence check is converted into a per-job outcome path.

Codex review notes: model gpt-5.5, reasoning high; reviewed against a256745323ab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants