fix(cron): protect tick loop from cascading mark_job_run failures#8290
fix(cron): protect tick loop from cascading mark_job_run failures#8290jooray wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the cron scheduler against cascading failures by ensuring exceptions inside mark_job_run() (and its compute_next_run() call) can’t abort the tick loop or prevent job state from being persisted.
Changes:
- Wrap the fallback
mark_job_run()call insidetick()’s exception handler so a secondary failure doesn’t terminate the tick loop. - Wrap
compute_next_run()insidemark_job_run()so schedule-computation errors don’t preventsave_jobs()from persisting updates.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
cron/scheduler.py |
Prevents tick-loop abort when mark_job_run() fails during error handling. |
cron/jobs.py |
Ensures save_jobs() is still reached even if compute_next_run() raises. |
Comments suppressed due to low confidence (1)
cron/jobs.py:629
- If compute_next_run raises, next_run_at is left unchanged and the code proceeds to state transitions. In cases where the stored next_run_at is already <= now (e.g., the job became due but advance_next_run failed due to the same schedule issue), this can lead to the job being picked up and failing on every subsequent tick, creating a noisy error loop. Consider explicitly pausing/disabling the job (or setting a safe future next_run_at) when compute_next_run fails to avoid repeated immediate retries.
try:
job["next_run_at"] = compute_next_run(job["schedule"], now)
except Exception as exc:
logger.error(
"Job '%s': compute_next_run failed (%s); keeping previous next_run_at",
job_id, exc,
)
# If no next run (one-shot completed), disable
if job["next_run_at"] is None:
job["enabled"] = False
job["state"] = "completed"
elif job.get("state") != "paused":
job["state"] = "scheduled"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| job["next_run_at"] = compute_next_run(job["schedule"], now) | ||
| except Exception as exc: | ||
| logger.error( |
There was a problem hiding this comment.
The exception handler for compute_next_run logs with logger.error but drops the traceback. Since this is catching a broad Exception and will be the primary signal for diagnosing bad schedule payloads/croniter failures, consider using logger.exception (or logger.error(..., exc_info=True)) so the stack trace is preserved in logs.
| logger.error( | |
| logger.exception( |
| # Compute next run — protect against exceptions so save_jobs | ||
| # is always reached and the job doesn't get stuck with null. | ||
| try: | ||
| job["next_run_at"] = compute_next_run(job["schedule"], now) | ||
| except Exception as exc: | ||
| logger.error( | ||
| "Job '%s': compute_next_run failed (%s); keeping previous next_run_at", | ||
| job_id, exc, | ||
| ) | ||
|
|
There was a problem hiding this comment.
There are existing unit tests for mark_job_run, but no coverage for the new defensive path where compute_next_run raises. Please add a test that forces compute_next_run to raise (e.g., via patch) and verifies mark_job_run still persists updates and leaves next_run_at in a sane state (unchanged from its previous value, or the chosen fallback).
| logger.exception( | ||
| "CRITICAL: mark_job_run also failed for job %s " | ||
| "(original error: %s, mark_job_run error: %s)", | ||
| job['id'], e, e2, |
There was a problem hiding this comment.
The log message prefix says "CRITICAL:" but this uses logger.exception (ERROR level). To avoid misleading severity in logs, either log at critical level (logger.critical(..., exc_info=True)) or remove the "CRITICAL" label from the message.
| logger.exception( | |
| "CRITICAL: mark_job_run also failed for job %s " | |
| "(original error: %s, mark_job_run error: %s)", | |
| job['id'], e, e2, | |
| logger.critical( | |
| "CRITICAL: mark_job_run also failed for job %s " | |
| "(original error: %s, mark_job_run error: %s)", | |
| job['id'], e, e2, | |
| exc_info=True, |
| except Exception as e: | ||
| logger.error("Error processing job %s: %s", job['id'], e) | ||
| mark_job_run(job["id"], False, str(e)) | ||
| try: | ||
| mark_job_run(job["id"], False, str(e)) | ||
| except Exception as e2: | ||
| logger.exception( | ||
| "CRITICAL: mark_job_run also failed for job %s " | ||
| "(original error: %s, mark_job_run error: %s)", | ||
| job['id'], e, e2, | ||
| ) |
There was a problem hiding this comment.
tick() has extensive tests, but there isn’t coverage for the newly added nested failure path (original job processing fails and mark_job_run also raises). Please add a test that sets up multiple due jobs, forces mark_job_run to raise in the except path, and asserts tick() continues processing remaining jobs without raising.
df34afb to
fb4d959
Compare
fb4d959 to
a8ed146
Compare
09cf8f6 to
d7fd8ea
Compare
7b67863 to
3995f7c
Compare
fe21297 to
d129b0f
Compare
d129b0f to
a608f96
Compare
b9da010 to
36bb48a
Compare
Two issues could cause all cron jobs to get stuck with next_run_at: null: 1. In mark_job_run(), compute_next_run() is called without exception protection. If it throws, save_jobs() is never reached and the job state is corrupted. 2. In tick(), when the primary mark_job_run call fails, the except block retries mark_job_run -- but if the retry also fails, the exception propagates and breaks processing of remaining due jobs. Fix: wrap compute_next_run in try/except to preserve the previous next_run_at on failure, and wrap the retry mark_job_run in tick() so one broken job cannot prevent others from running.
36bb48a to
745bb30
Compare
Summary
cron/jobs.py: Wrapcompute_next_runinmark_job_runwith try/except so a scheduling-expression error doesn't preventsave_jobsfrom running. Without this, a single bad schedule can leavenext_run_atasnull, permanently disabling the job until manual intervention.cron/scheduler.py: Wrap the fallbackmark_job_runcall in thetick()exception handler. Previously, if the primary job execution failed andmark_job_runalso raised, the unhandled exception would abort the tick loop for all remaining due jobs.Both changes are purely defensive error-handling — no behavioral change when everything works normally.
Reproduction
delegate_tasktimes out (900s+)compute_next_runraises duringmark_job_run,next_run_atis set tonullTest plan
EOF
)