cron: jobs with null next_run_at silently skipped; non-dict origin crashes ticker

Two related robustness gaps in the cron subsystem became visible when ops scripts wrote directly into `~/.hermes/cron/jobs.json` instead of going through `add_job()` / dashboard / API. Both manifested in the same incident; reporting them together since the fix is small and shares one PR.

## Bug 1 — `kind: cron` / `kind: interval` jobs with `next_run_at: null` are silently skipped forever

**Symptom**: Job appears in `jobs.json` with `enabled: true`, `state: scheduled`, `next_run_at: null`, `last_run_at: null` indefinitely. Other crons fire normally. No log entry indicates it's being skipped.

**Cause**: `cron/jobs.py` `get_due_jobs()` (around the loop at L794–L834) only attempts recovery via `_recoverable_oneshot_run_at()`, which is hard-gated to `kind: once`. For recurring kinds, the helper returns `None` → `continue` → job is silently skipped on every tick. The loader assumes the only path into `jobs.json` is `add_job()`, which populates `next_run_at` via `compute_next_run()` at line 526. Any external writer (jq, a migration script, the dashboard's REST patch endpoint that forgets to set the field, etc.) that creates a recurring entry without that field leaves the job unfireable.

**Repro**:
```python
from cron.jobs import save_jobs, get_due_jobs, get_job
save_jobs([{
    "id": "repro",
    "name": "AI Daily Digest",
    "prompt": "...",
    "schedule": {"kind": "cron", "expr": "0 12 * * *", "display": "0 12 * * *"},
    "schedule_display": "0 12 * * *",
    "repeat": {"times": None, "completed": 0},
    "enabled": True,
    "state": "scheduled",
    "next_run_at": None, "last_run_at": None, "last_status": None,
    "last_error": None, "deliver": "local", "origin": None,
}])
get_due_jobs()  # returns [], next_run_at still None — and stays None forever
```

**Fix**: when the schedule is `cron` / `interval` and `next_run_at` is missing, recompute via `compute_next_run(schedule, now.isoformat())` instead of returning `None`. The existing one-shot grace-window path is untouched. Patch + tests below.

## Bug 2 — `_resolve_origin` crashes with `'str' object has no attribute 'get'` when `origin` is a string

**Symptom**:
```
ERROR cron.scheduler: Error processing job <id>: 'str' object has no attribute 'get'
```
…on every fire attempt. Job's `last_status: error`, `last_error: "'str' object has no attribute 'get'"`. `mark_job_run` does record the failure, but every subsequent fire crashes the same way until `origin` is fixed manually.

**Cause**: `cron/scheduler.py:127` `_resolve_origin()` does `origin.get("platform")` on whatever `job.get("origin")` returns. The function checks `if not origin` (falsy short-circuit), but a non-empty string passes that guard and then hits `AttributeError`. In practice this happened because a migration script tagged jobs with a free-form provenance string (e.g. `"combined-digest-replaces-x-ai-and-email-triage-20260503"`) instead of either `null` or `{platform, chat_id}`.

**Fix**: add `isinstance(origin, dict)` guard; non-dict origin (string, list, int…) is treated the same as missing origin. Patch + tests below.

---

## Patch

```diff
--- a/cron/jobs.py
+++ b/cron/jobs.py
@@ -795,17 +795,32 @@ def get_due_jobs() -> List[Dict[str, Any]]:
         if not job.get("enabled", True):
             continue

         next_run = job.get("next_run_at")
         if not next_run:
+            schedule = job.get("schedule", {})
+            kind = schedule.get("kind")
+
+            # One-shot jobs use a small grace window via the dedicated helper.
             recovered_next = _recoverable_oneshot_run_at(
-                job.get("schedule", {}),
+                schedule,
                 now,
                 last_run_at=job.get("last_run_at"),
             )
+            recovery_kind = "one-shot" if recovered_next else None
+
+            # Recurring jobs (cron / interval) reach here only when something
+            # — typically a direct jobs.json edit that bypassed add_job() —
+            # left next_run_at unset.  Without this branch, such jobs are
+            # silently skipped forever; recompute next_run_at from the
+            # schedule so they pick up at their next scheduled tick.
+            if not recovered_next and kind in ("cron", "interval"):
+                recovered_next = compute_next_run(schedule, now.isoformat())
+                if recovered_next:
+                    recovery_kind = kind
+
             if not recovered_next:
                 continue

             job["next_run_at"] = recovered_next
             next_run = recovered_next
             logger.info(
-                "Job '%s' had no next_run_at; recovering one-shot run at %s",
+                "Job '%s' had no next_run_at; recovering %s run at %s",
                 job.get("name", job["id"]),
+                recovery_kind,
                 recovered_next,
             )
```

```diff
--- a/cron/scheduler.py
+++ b/cron/scheduler.py
@@ -123,11 +123,18 @@ class _OutboundContextStub:

 def _resolve_origin(job: dict) -> Optional[dict]:
-    """Extract origin info from a job, preserving any extra routing metadata."""
+    """Extract origin info from a job, preserving any extra routing metadata.
+
+    ``origin`` is expected to be either ``None`` or a dict shaped like
+    ``{"platform": ..., "chat_id": ..., "thread_id": ...}``.  Tolerate
+    other shapes (most commonly: a free-form string identifier left by
+    a script that wrote jobs.json directly) by returning ``None`` rather
+    than crashing the whole tick with ``AttributeError``.
+    """
     origin = job.get("origin")
-    if not origin:
+    if not origin or not isinstance(origin, dict):
         return None
     platform = origin.get("platform")
     chat_id = origin.get("chat_id")
     if platform and chat_id:
         return origin
     return None
```

## New tests

`tests/cron/test_jobs.py::TestGetDueJobs`:

- `test_broken_cron_without_next_run_is_recovered` — cron-kind null next_run_at gets recomputed
- `test_broken_interval_without_next_run_is_recovered` — same for interval

`tests/cron/test_scheduler.py::TestResolveOrigin`:

- `test_string_origin_is_tolerated` — string origin returns None, no crash
- `test_non_dict_origin_is_tolerated` — list/int origin returns None

All 289 existing cron tests still pass sequentially. (Two parallel-mode flakes under xdist are pre-existing and unrelated; same tests pass in isolation.)

## Environment
- hermes-agent commit: `upstream/main` as of 2026-05-02
- Python 3.14, croniter installed
- Encountered on a Docker deployment (Linux Debian, container running upstream image)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cron: jobs with null next_run_at silently skipped; non-dict origin crashes ticker #18722

Bug 1 — `kind: cron` / `kind: interval` jobs with `next_run_at: null` are silently skipped forever

Bug 2 — `_resolve_origin` crashes with `'str' object has no attribute 'get'` when `origin` is a string

Patch

New tests

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cron: jobs with null next_run_at silently skipped; non-dict origin crashes ticker #18722

Description

Bug 1 — kind: cron / kind: interval jobs with next_run_at: null are silently skipped forever

Bug 2 — _resolve_origin crashes with 'str' object has no attribute 'get' when origin is a string

Patch

New tests

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `kind: cron` / `kind: interval` jobs with `next_run_at: null` are silently skipped forever

Bug 2 — `_resolve_origin` crashes with `'str' object has no attribute 'get'` when `origin` is a string