Two related robustness gaps in the cron subsystem became visible when ops scripts wrote directly into ~/.hermes/cron/jobs.json instead of going through add_job() / dashboard / API. Both manifested in the same incident; reporting them together since the fix is small and shares one PR.
Bug 1 — kind: cron / kind: interval jobs with next_run_at: null are silently skipped forever
Symptom: Job appears in jobs.json with enabled: true, state: scheduled, next_run_at: null, last_run_at: null indefinitely. Other crons fire normally. No log entry indicates it's being skipped.
Cause: cron/jobs.py get_due_jobs() (around the loop at L794–L834) only attempts recovery via _recoverable_oneshot_run_at(), which is hard-gated to kind: once. For recurring kinds, the helper returns None → continue → job is silently skipped on every tick. The loader assumes the only path into jobs.json is add_job(), which populates next_run_at via compute_next_run() at line 526. Any external writer (jq, a migration script, the dashboard's REST patch endpoint that forgets to set the field, etc.) that creates a recurring entry without that field leaves the job unfireable.
Repro:
from cron.jobs import save_jobs, get_due_jobs, get_job
save_jobs([{
"id": "repro",
"name": "AI Daily Digest",
"prompt": "...",
"schedule": {"kind": "cron", "expr": "0 12 * * *", "display": "0 12 * * *"},
"schedule_display": "0 12 * * *",
"repeat": {"times": None, "completed": 0},
"enabled": True,
"state": "scheduled",
"next_run_at": None, "last_run_at": None, "last_status": None,
"last_error": None, "deliver": "local", "origin": None,
}])
get_due_jobs() # returns [], next_run_at still None — and stays None forever
Fix: when the schedule is cron / interval and next_run_at is missing, recompute via compute_next_run(schedule, now.isoformat()) instead of returning None. The existing one-shot grace-window path is untouched. Patch + tests below.
Bug 2 — _resolve_origin crashes with 'str' object has no attribute 'get' when origin is a string
Symptom:
ERROR cron.scheduler: Error processing job <id>: 'str' object has no attribute 'get'
…on every fire attempt. Job's last_status: error, last_error: "'str' object has no attribute 'get'". mark_job_run does record the failure, but every subsequent fire crashes the same way until origin is fixed manually.
Cause: cron/scheduler.py:127 _resolve_origin() does origin.get("platform") on whatever job.get("origin") returns. The function checks if not origin (falsy short-circuit), but a non-empty string passes that guard and then hits AttributeError. In practice this happened because a migration script tagged jobs with a free-form provenance string (e.g. "combined-digest-replaces-x-ai-and-email-triage-20260503") instead of either null or {platform, chat_id}.
Fix: add isinstance(origin, dict) guard; non-dict origin (string, list, int…) is treated the same as missing origin. Patch + tests below.
Patch
--- a/cron/jobs.py
+++ b/cron/jobs.py
@@ -795,17 +795,32 @@ def get_due_jobs() -> List[Dict[str, Any]]:
if not job.get("enabled", True):
continue
next_run = job.get("next_run_at")
if not next_run:
+ schedule = job.get("schedule", {})
+ kind = schedule.get("kind")
+
+ # One-shot jobs use a small grace window via the dedicated helper.
recovered_next = _recoverable_oneshot_run_at(
- job.get("schedule", {}),
+ schedule,
now,
last_run_at=job.get("last_run_at"),
)
+ recovery_kind = "one-shot" if recovered_next else None
+
+ # Recurring jobs (cron / interval) reach here only when something
+ # — typically a direct jobs.json edit that bypassed add_job() —
+ # left next_run_at unset. Without this branch, such jobs are
+ # silently skipped forever; recompute next_run_at from the
+ # schedule so they pick up at their next scheduled tick.
+ if not recovered_next and kind in ("cron", "interval"):
+ recovered_next = compute_next_run(schedule, now.isoformat())
+ if recovered_next:
+ recovery_kind = kind
+
if not recovered_next:
continue
job["next_run_at"] = recovered_next
next_run = recovered_next
logger.info(
- "Job '%s' had no next_run_at; recovering one-shot run at %s",
+ "Job '%s' had no next_run_at; recovering %s run at %s",
job.get("name", job["id"]),
+ recovery_kind,
recovered_next,
)
--- a/cron/scheduler.py
+++ b/cron/scheduler.py
@@ -123,11 +123,18 @@ class _OutboundContextStub:
def _resolve_origin(job: dict) -> Optional[dict]:
- """Extract origin info from a job, preserving any extra routing metadata."""
+ """Extract origin info from a job, preserving any extra routing metadata.
+
+ ``origin`` is expected to be either ``None`` or a dict shaped like
+ ``{"platform": ..., "chat_id": ..., "thread_id": ...}``. Tolerate
+ other shapes (most commonly: a free-form string identifier left by
+ a script that wrote jobs.json directly) by returning ``None`` rather
+ than crashing the whole tick with ``AttributeError``.
+ """
origin = job.get("origin")
- if not origin:
+ if not origin or not isinstance(origin, dict):
return None
platform = origin.get("platform")
chat_id = origin.get("chat_id")
if platform and chat_id:
return origin
return None
New tests
tests/cron/test_jobs.py::TestGetDueJobs:
test_broken_cron_without_next_run_is_recovered — cron-kind null next_run_at gets recomputed
test_broken_interval_without_next_run_is_recovered — same for interval
tests/cron/test_scheduler.py::TestResolveOrigin:
test_string_origin_is_tolerated — string origin returns None, no crash
test_non_dict_origin_is_tolerated — list/int origin returns None
All 289 existing cron tests still pass sequentially. (Two parallel-mode flakes under xdist are pre-existing and unrelated; same tests pass in isolation.)
Environment
- hermes-agent commit:
upstream/main as of 2026-05-02
- Python 3.14, croniter installed
- Encountered on a Docker deployment (Linux Debian, container running upstream image)
Two related robustness gaps in the cron subsystem became visible when ops scripts wrote directly into
~/.hermes/cron/jobs.jsoninstead of going throughadd_job()/ dashboard / API. Both manifested in the same incident; reporting them together since the fix is small and shares one PR.Bug 1 —
kind: cron/kind: intervaljobs withnext_run_at: nullare silently skipped foreverSymptom: Job appears in
jobs.jsonwithenabled: true,state: scheduled,next_run_at: null,last_run_at: nullindefinitely. Other crons fire normally. No log entry indicates it's being skipped.Cause:
cron/jobs.pyget_due_jobs()(around the loop at L794–L834) only attempts recovery via_recoverable_oneshot_run_at(), which is hard-gated tokind: once. For recurring kinds, the helper returnsNone→continue→ job is silently skipped on every tick. The loader assumes the only path intojobs.jsonisadd_job(), which populatesnext_run_atviacompute_next_run()at line 526. Any external writer (jq, a migration script, the dashboard's REST patch endpoint that forgets to set the field, etc.) that creates a recurring entry without that field leaves the job unfireable.Repro:
Fix: when the schedule is
cron/intervalandnext_run_atis missing, recompute viacompute_next_run(schedule, now.isoformat())instead of returningNone. The existing one-shot grace-window path is untouched. Patch + tests below.Bug 2 —
_resolve_origincrashes with'str' object has no attribute 'get'whenoriginis a stringSymptom:
…on every fire attempt. Job's
last_status: error,last_error: "'str' object has no attribute 'get'".mark_job_rundoes record the failure, but every subsequent fire crashes the same way untiloriginis fixed manually.Cause:
cron/scheduler.py:127_resolve_origin()doesorigin.get("platform")on whateverjob.get("origin")returns. The function checksif not origin(falsy short-circuit), but a non-empty string passes that guard and then hitsAttributeError. In practice this happened because a migration script tagged jobs with a free-form provenance string (e.g."combined-digest-replaces-x-ai-and-email-triage-20260503") instead of eithernullor{platform, chat_id}.Fix: add
isinstance(origin, dict)guard; non-dict origin (string, list, int…) is treated the same as missing origin. Patch + tests below.Patch
New tests
tests/cron/test_jobs.py::TestGetDueJobs:test_broken_cron_without_next_run_is_recovered— cron-kind null next_run_at gets recomputedtest_broken_interval_without_next_run_is_recovered— same for intervaltests/cron/test_scheduler.py::TestResolveOrigin:test_string_origin_is_tolerated— string origin returns None, no crashtest_non_dict_origin_is_tolerated— list/int origin returns NoneAll 289 existing cron tests still pass sequentially. (Two parallel-mode flakes under xdist are pre-existing and unrelated; same tests pass in isolation.)
Environment
upstream/mainas of 2026-05-02