Skip to content

cron: jobs with null next_run_at silently skipped; non-dict origin crashes ticker #18722

@liyoungc

Description

@liyoungc

Two related robustness gaps in the cron subsystem became visible when ops scripts wrote directly into ~/.hermes/cron/jobs.json instead of going through add_job() / dashboard / API. Both manifested in the same incident; reporting them together since the fix is small and shares one PR.

Bug 1 — kind: cron / kind: interval jobs with next_run_at: null are silently skipped forever

Symptom: Job appears in jobs.json with enabled: true, state: scheduled, next_run_at: null, last_run_at: null indefinitely. Other crons fire normally. No log entry indicates it's being skipped.

Cause: cron/jobs.py get_due_jobs() (around the loop at L794–L834) only attempts recovery via _recoverable_oneshot_run_at(), which is hard-gated to kind: once. For recurring kinds, the helper returns Nonecontinue → job is silently skipped on every tick. The loader assumes the only path into jobs.json is add_job(), which populates next_run_at via compute_next_run() at line 526. Any external writer (jq, a migration script, the dashboard's REST patch endpoint that forgets to set the field, etc.) that creates a recurring entry without that field leaves the job unfireable.

Repro:

from cron.jobs import save_jobs, get_due_jobs, get_job
save_jobs([{
    "id": "repro",
    "name": "AI Daily Digest",
    "prompt": "...",
    "schedule": {"kind": "cron", "expr": "0 12 * * *", "display": "0 12 * * *"},
    "schedule_display": "0 12 * * *",
    "repeat": {"times": None, "completed": 0},
    "enabled": True,
    "state": "scheduled",
    "next_run_at": None, "last_run_at": None, "last_status": None,
    "last_error": None, "deliver": "local", "origin": None,
}])
get_due_jobs()  # returns [], next_run_at still None — and stays None forever

Fix: when the schedule is cron / interval and next_run_at is missing, recompute via compute_next_run(schedule, now.isoformat()) instead of returning None. The existing one-shot grace-window path is untouched. Patch + tests below.

Bug 2 — _resolve_origin crashes with 'str' object has no attribute 'get' when origin is a string

Symptom:

ERROR cron.scheduler: Error processing job <id>: 'str' object has no attribute 'get'

…on every fire attempt. Job's last_status: error, last_error: "'str' object has no attribute 'get'". mark_job_run does record the failure, but every subsequent fire crashes the same way until origin is fixed manually.

Cause: cron/scheduler.py:127 _resolve_origin() does origin.get("platform") on whatever job.get("origin") returns. The function checks if not origin (falsy short-circuit), but a non-empty string passes that guard and then hits AttributeError. In practice this happened because a migration script tagged jobs with a free-form provenance string (e.g. "combined-digest-replaces-x-ai-and-email-triage-20260503") instead of either null or {platform, chat_id}.

Fix: add isinstance(origin, dict) guard; non-dict origin (string, list, int…) is treated the same as missing origin. Patch + tests below.


Patch

--- a/cron/jobs.py
+++ b/cron/jobs.py
@@ -795,17 +795,32 @@ def get_due_jobs() -> List[Dict[str, Any]]:
         if not job.get("enabled", True):
             continue

         next_run = job.get("next_run_at")
         if not next_run:
+            schedule = job.get("schedule", {})
+            kind = schedule.get("kind")
+
+            # One-shot jobs use a small grace window via the dedicated helper.
             recovered_next = _recoverable_oneshot_run_at(
-                job.get("schedule", {}),
+                schedule,
                 now,
                 last_run_at=job.get("last_run_at"),
             )
+            recovery_kind = "one-shot" if recovered_next else None
+
+            # Recurring jobs (cron / interval) reach here only when something
+            # — typically a direct jobs.json edit that bypassed add_job() —
+            # left next_run_at unset.  Without this branch, such jobs are
+            # silently skipped forever; recompute next_run_at from the
+            # schedule so they pick up at their next scheduled tick.
+            if not recovered_next and kind in ("cron", "interval"):
+                recovered_next = compute_next_run(schedule, now.isoformat())
+                if recovered_next:
+                    recovery_kind = kind
+
             if not recovered_next:
                 continue

             job["next_run_at"] = recovered_next
             next_run = recovered_next
             logger.info(
-                "Job '%s' had no next_run_at; recovering one-shot run at %s",
+                "Job '%s' had no next_run_at; recovering %s run at %s",
                 job.get("name", job["id"]),
+                recovery_kind,
                 recovered_next,
             )
--- a/cron/scheduler.py
+++ b/cron/scheduler.py
@@ -123,11 +123,18 @@ class _OutboundContextStub:

 def _resolve_origin(job: dict) -> Optional[dict]:
-    """Extract origin info from a job, preserving any extra routing metadata."""
+    """Extract origin info from a job, preserving any extra routing metadata.
+
+    ``origin`` is expected to be either ``None`` or a dict shaped like
+    ``{"platform": ..., "chat_id": ..., "thread_id": ...}``.  Tolerate
+    other shapes (most commonly: a free-form string identifier left by
+    a script that wrote jobs.json directly) by returning ``None`` rather
+    than crashing the whole tick with ``AttributeError``.
+    """
     origin = job.get("origin")
-    if not origin:
+    if not origin or not isinstance(origin, dict):
         return None
     platform = origin.get("platform")
     chat_id = origin.get("chat_id")
     if platform and chat_id:
         return origin
     return None

New tests

tests/cron/test_jobs.py::TestGetDueJobs:

  • test_broken_cron_without_next_run_is_recovered — cron-kind null next_run_at gets recomputed
  • test_broken_interval_without_next_run_is_recovered — same for interval

tests/cron/test_scheduler.py::TestResolveOrigin:

  • test_string_origin_is_tolerated — string origin returns None, no crash
  • test_non_dict_origin_is_tolerated — list/int origin returns None

All 289 existing cron tests still pass sequentially. (Two parallel-mode flakes under xdist are pre-existing and unrelated; same tests pass in isolation.)

Environment

  • hermes-agent commit: upstream/main as of 2026-05-02
  • Python 3.14, croniter installed
  • Encountered on a Docker deployment (Linux Debian, container running upstream image)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/cronCron scheduler and job managementtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions