fix(cron): protect tick loop from cascading mark_job_run failures by jooray · Pull Request #8290 · NousResearch/hermes-agent

jooray · 2026-04-12T09:35:16Z

Summary

cron/jobs.py: Wrap compute_next_run in mark_job_run with try/except so a scheduling-expression error doesn't prevent save_jobs from running. Without this, a single bad schedule can leave next_run_at as null, permanently disabling the job until manual intervention.
cron/scheduler.py: Wrap the fallback mark_job_run call in the tick() exception handler. Previously, if the primary job execution failed and mark_job_run also raised, the unhandled exception would abort the tick loop for all remaining due jobs.

Both changes are purely defensive error-handling — no behavioral change when everything works normally.

Reproduction

Create a cron job whose delegate_task times out (900s+)
If compute_next_run raises during mark_job_run, next_run_at is set to null
On next tick, the job is skipped (not a one-shot, so recovery logic doesn't apply
The job never runs again until is manually edited

Test plan

Verify existing cron tests pass
Create a job, kill the gateway mid-execution, restart — job should recover with a valid
Inject a bad schedule expression — job should log error but not break other jobs in the same tick
EOF
)

Copilot

Pull request overview

This PR hardens the cron scheduler against cascading failures by ensuring exceptions inside mark_job_run() (and its compute_next_run() call) can’t abort the tick loop or prevent job state from being persisted.

Changes:

Wrap the fallback mark_job_run() call inside tick()’s exception handler so a secondary failure doesn’t terminate the tick loop.
Wrap compute_next_run() inside mark_job_run() so schedule-computation errors don’t prevent save_jobs() from persisting updates.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`cron/scheduler.py`	Prevents tick-loop abort when `mark_job_run()` fails during error handling.
`cron/jobs.py`	Ensures `save_jobs()` is still reached even if `compute_next_run()` raises.

Comments suppressed due to low confidence (1)

cron/jobs.py:629

If compute_next_run raises, next_run_at is left unchanged and the code proceeds to state transitions. In cases where the stored next_run_at is already <= now (e.g., the job became due but advance_next_run failed due to the same schedule issue), this can lead to the job being picked up and failing on every subsequent tick, creating a noisy error loop. Consider explicitly pausing/disabling the job (or setting a safe future next_run_at) when compute_next_run fails to avoid repeated immediate retries.

            try:
                job["next_run_at"] = compute_next_run(job["schedule"], now)
            except Exception as exc:
                logger.error(
                    "Job '%s': compute_next_run failed (%s); keeping previous next_run_at",
                    job_id, exc,
                )

            # If no next run (one-shot completed), disable
            if job["next_run_at"] is None:
                job["enabled"] = False
                job["state"] = "completed"
            elif job.get("state") != "paused":
                job["state"] = "scheduled"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-12T09:39:34Z

+            try:
+                job["next_run_at"] = compute_next_run(job["schedule"], now)
+            except Exception as exc:
+                logger.error(


The exception handler for compute_next_run logs with logger.error but drops the traceback. Since this is catching a broad Exception and will be the primary signal for diagnosing bad schedule payloads/croniter failures, consider using logger.exception (or logger.error(..., exc_info=True)) so the stack trace is preserved in logs.

Suggested change

logger.error(

logger.exception(

Copilot · 2026-04-12T09:39:34Z

+            # Compute next run — protect against exceptions so save_jobs
+            # is always reached and the job doesn't get stuck with null.
+            try:
+                job["next_run_at"] = compute_next_run(job["schedule"], now)
+            except Exception as exc:
+                logger.error(
+                    "Job '%s': compute_next_run failed (%s); keeping previous next_run_at",
+                    job_id, exc,
+                )



There are existing unit tests for mark_job_run, but no coverage for the new defensive path where compute_next_run raises. Please add a test that forces compute_next_run to raise (e.g., via patch) and verifies mark_job_run still persists updates and leaves next_run_at in a sane state (unchanged from its previous value, or the chosen fallback).

Copilot · 2026-04-12T09:39:34Z

+                    logger.exception(
+                        "CRITICAL: mark_job_run also failed for job %s "
+                        "(original error: %s, mark_job_run error: %s)",
+                        job['id'], e, e2,


The log message prefix says "CRITICAL:" but this uses logger.exception (ERROR level). To avoid misleading severity in logs, either log at critical level (logger.critical(..., exc_info=True)) or remove the "CRITICAL" label from the message.

Suggested change

logger.exception(

"CRITICAL: mark_job_run also failed for job %s "

"(original error: %s, mark_job_run error: %s)",

job['id'], e, e2,

logger.critical(

"CRITICAL: mark_job_run also failed for job %s "

"(original error: %s, mark_job_run error: %s)",

job['id'], e, e2,

exc_info=True,

Copilot · 2026-04-12T09:39:35Z

            except Exception as e:
                logger.error("Error processing job %s: %s", job['id'], e)
-                mark_job_run(job["id"], False, str(e))
+                try:
+                    mark_job_run(job["id"], False, str(e))
+                except Exception as e2:
+                    logger.exception(
+                        "CRITICAL: mark_job_run also failed for job %s "
+                        "(original error: %s, mark_job_run error: %s)",
+                        job['id'], e, e2,
+                    )


tick() has extensive tests, but there isn’t coverage for the newly added nested failure path (original job processing fails and mark_job_run also raises). Please add a test that sets up multiple due jobs, forces mark_job_run to raise in the except path, and asserts tick() continues processing remaining jobs without raising.

Two issues could cause all cron jobs to get stuck with next_run_at: null: 1. In mark_job_run(), compute_next_run() is called without exception protection. If it throws, save_jobs() is never reached and the job state is corrupted. 2. In tick(), when the primary mark_job_run call fails, the except block retries mark_job_run -- but if the retry also fails, the exception propagates and breaks processing of remaining due jobs. Fix: wrap compute_next_run in try/except to preserve the previous next_run_at on failure, and wrap the retry mark_job_run in tick() so one broken job cannot prevent others from running.

Copilot AI review requested due to automatic review settings April 12, 2026 09:35

Copilot started reviewing on behalf of jooray April 12, 2026 09:35 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

jooray force-pushed the fix/cron-scheduler-resilience branch 3 times, most recently from df34afb to fb4d959 Compare April 20, 2026 09:13

jooray force-pushed the fix/cron-scheduler-resilience branch from fb4d959 to a8ed146 Compare April 22, 2026 10:42

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management labels Apr 22, 2026

jooray force-pushed the fix/cron-scheduler-resilience branch 2 times, most recently from 09cf8f6 to d7fd8ea Compare April 24, 2026 09:08

franksong2702 mentioned this pull request Apr 27, 2026

Recurring cron jobs are marked completed/off when croniter is missing #16265

Closed

jooray force-pushed the fix/cron-scheduler-resilience branch 2 times, most recently from 7b67863 to 3995f7c Compare May 1, 2026 13:07

jooray force-pushed the fix/cron-scheduler-resilience branch 2 times, most recently from fe21297 to d129b0f Compare May 10, 2026 08:03

jooray force-pushed the fix/cron-scheduler-resilience branch from d129b0f to a608f96 Compare May 18, 2026 10:55

jooray force-pushed the fix/cron-scheduler-resilience branch 3 times, most recently from b9da010 to 36bb48a Compare May 30, 2026 07:44

jooray force-pushed the fix/cron-scheduler-resilience branch from 36bb48a to 745bb30 Compare June 9, 2026 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cron): protect tick loop from cascading mark_job_run failures#8290

fix(cron): protect tick loop from cascading mark_job_run failures#8290
jooray wants to merge 1 commit into
NousResearch:mainfrom
jooray:fix/cron-scheduler-resilience

jooray commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jooray commented Apr 12, 2026

Summary

Reproduction

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants