Skip to content

fix(cron): protect tick loop from cascading mark_job_run failures#8290

Open
jooray wants to merge 1 commit into
NousResearch:mainfrom
jooray:fix/cron-scheduler-resilience
Open

fix(cron): protect tick loop from cascading mark_job_run failures#8290
jooray wants to merge 1 commit into
NousResearch:mainfrom
jooray:fix/cron-scheduler-resilience

Conversation

@jooray

@jooray jooray commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • cron/jobs.py: Wrap compute_next_run in mark_job_run with try/except so a scheduling-expression error doesn't prevent save_jobs from running. Without this, a single bad schedule can leave next_run_at as null, permanently disabling the job until manual intervention.
  • cron/scheduler.py: Wrap the fallback mark_job_run call in the tick() exception handler. Previously, if the primary job execution failed and mark_job_run also raised, the unhandled exception would abort the tick loop for all remaining due jobs.

Both changes are purely defensive error-handling — no behavioral change when everything works normally.

Reproduction

  1. Create a cron job whose delegate_task times out (900s+)
  2. If compute_next_run raises during mark_job_run, next_run_at is set to null
  3. On next tick, the job is skipped (not a one-shot, so recovery logic doesn't apply
  4. The job never runs again until is manually edited

Test plan

  • Verify existing cron tests pass
  • Create a job, kill the gateway mid-execution, restart — job should recover with a valid
  • Inject a bad schedule expression — job should log error but not break other jobs in the same tick
    EOF
    )

Copilot AI review requested due to automatic review settings April 12, 2026 09:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the cron scheduler against cascading failures by ensuring exceptions inside mark_job_run() (and its compute_next_run() call) can’t abort the tick loop or prevent job state from being persisted.

Changes:

  • Wrap the fallback mark_job_run() call inside tick()’s exception handler so a secondary failure doesn’t terminate the tick loop.
  • Wrap compute_next_run() inside mark_job_run() so schedule-computation errors don’t prevent save_jobs() from persisting updates.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
cron/scheduler.py Prevents tick-loop abort when mark_job_run() fails during error handling.
cron/jobs.py Ensures save_jobs() is still reached even if compute_next_run() raises.
Comments suppressed due to low confidence (1)

cron/jobs.py:629

  • If compute_next_run raises, next_run_at is left unchanged and the code proceeds to state transitions. In cases where the stored next_run_at is already <= now (e.g., the job became due but advance_next_run failed due to the same schedule issue), this can lead to the job being picked up and failing on every subsequent tick, creating a noisy error loop. Consider explicitly pausing/disabling the job (or setting a safe future next_run_at) when compute_next_run fails to avoid repeated immediate retries.
            try:
                job["next_run_at"] = compute_next_run(job["schedule"], now)
            except Exception as exc:
                logger.error(
                    "Job '%s': compute_next_run failed (%s); keeping previous next_run_at",
                    job_id, exc,
                )

            # If no next run (one-shot completed), disable
            if job["next_run_at"] is None:
                job["enabled"] = False
                job["state"] = "completed"
            elif job.get("state") != "paused":
                job["state"] = "scheduled"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cron/jobs.py Outdated
try:
job["next_run_at"] = compute_next_run(job["schedule"], now)
except Exception as exc:
logger.error(

Copilot AI Apr 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler for compute_next_run logs with logger.error but drops the traceback. Since this is catching a broad Exception and will be the primary signal for diagnosing bad schedule payloads/croniter failures, consider using logger.exception (or logger.error(..., exc_info=True)) so the stack trace is preserved in logs.

Suggested change
logger.error(
logger.exception(

Copilot uses AI. Check for mistakes.
Comment thread cron/jobs.py Outdated
Comment on lines 614 to 623
# Compute next run — protect against exceptions so save_jobs
# is always reached and the job doesn't get stuck with null.
try:
job["next_run_at"] = compute_next_run(job["schedule"], now)
except Exception as exc:
logger.error(
"Job '%s': compute_next_run failed (%s); keeping previous next_run_at",
job_id, exc,
)

Copilot AI Apr 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing unit tests for mark_job_run, but no coverage for the new defensive path where compute_next_run raises. Please add a test that forces compute_next_run to raise (e.g., via patch) and verifies mark_job_run still persists updates and leaves next_run_at in a sane state (unchanged from its previous value, or the chosen fallback).

Copilot uses AI. Check for mistakes.
Comment thread cron/scheduler.py
Comment on lines +978 to +981
logger.exception(
"CRITICAL: mark_job_run also failed for job %s "
"(original error: %s, mark_job_run error: %s)",
job['id'], e, e2,

Copilot AI Apr 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message prefix says "CRITICAL:" but this uses logger.exception (ERROR level). To avoid misleading severity in logs, either log at critical level (logger.critical(..., exc_info=True)) or remove the "CRITICAL" label from the message.

Suggested change
logger.exception(
"CRITICAL: mark_job_run also failed for job %s "
"(original error: %s, mark_job_run error: %s)",
job['id'], e, e2,
logger.critical(
"CRITICAL: mark_job_run also failed for job %s "
"(original error: %s, mark_job_run error: %s)",
job['id'], e, e2,
exc_info=True,

Copilot uses AI. Check for mistakes.
Comment thread cron/scheduler.py
Comment on lines 973 to +982
except Exception as e:
logger.error("Error processing job %s: %s", job['id'], e)
mark_job_run(job["id"], False, str(e))
try:
mark_job_run(job["id"], False, str(e))
except Exception as e2:
logger.exception(
"CRITICAL: mark_job_run also failed for job %s "
"(original error: %s, mark_job_run error: %s)",
job['id'], e, e2,
)

Copilot AI Apr 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tick() has extensive tests, but there isn’t coverage for the newly added nested failure path (original job processing fails and mark_job_run also raises). Please add a test that sets up multiple due jobs, forces mark_job_run to raise in the except path, and asserts tick() continues processing remaining jobs without raising.

Copilot uses AI. Check for mistakes.
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch 3 times, most recently from df34afb to fb4d959 Compare April 20, 2026 09:13
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch from fb4d959 to a8ed146 Compare April 22, 2026 10:42
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management labels Apr 22, 2026
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch 2 times, most recently from 09cf8f6 to d7fd8ea Compare April 24, 2026 09:08
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch 2 times, most recently from 7b67863 to 3995f7c Compare May 1, 2026 13:07
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch 2 times, most recently from fe21297 to d129b0f Compare May 10, 2026 08:03
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch from d129b0f to a608f96 Compare May 18, 2026 10:55
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch 3 times, most recently from b9da010 to 36bb48a Compare May 30, 2026 07:44
Two issues could cause all cron jobs to get stuck with next_run_at: null:

1. In mark_job_run(), compute_next_run() is called without exception
   protection. If it throws, save_jobs() is never reached and the job
   state is corrupted.

2. In tick(), when the primary mark_job_run call fails, the except
   block retries mark_job_run -- but if the retry also fails, the
   exception propagates and breaks processing of remaining due jobs.

Fix: wrap compute_next_run in try/except to preserve the previous
next_run_at on failure, and wrap the retry mark_job_run in tick()
so one broken job cannot prevent others from running.
@jooray jooray force-pushed the fix/cron-scheduler-resilience branch from 36bb48a to 745bb30 Compare June 9, 2026 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants