feat(cron): run independent cron jobs in parallel with safe non-overlap by Rubirub · Pull Request #7158 · NousResearch/hermes-agent

Rubirub · 2026-04-10T12:10:13Z

Summary

run independent cron jobs in parallel instead of serializing scheduler execution
preserve per-job non-overlap with claim-time ownership and job-specific locking
recover orphaned in-flight jobs conservatively after gateway restarts while preserving timeout fallback and legacy owner compatibility
keep recovery, claim ordering, and run finalization semantics consistent with the existing cron model

Design

The scheduler still performs recovery before claim on each tick. Claimed work now carries owner metadata so different jobs can run concurrently while the same job remains protected from overlap. If a gateway instance dies after claiming work, the next instance can recover definitely orphaned claims after a short grace period. When owner liveness is unclear, behavior stays conservative and falls back to timeout-based recovery.

Safety properties

different jobs may run concurrently
the same job cannot overlap with itself
stale completions are discarded if they no longer own the claim
repeat accounting and run finalization continue to use the shared outcome path
legacy persisted owner metadata still falls back safely

Test Plan

python -m pytest tests/cron/test_jobs.py tests/cron/test_scheduler.py tests/hermes_cli/test_cron.py tests/tools/test_cronjob_tools.py -q -o addopts=''

Issue links

MiraiChino · 2026-04-12T01:36:21Z

This is a great PR — the orphan recovery mechanism is exactly what cron needs to handle gateway restarts cleanly. The per-job locking and in-flight tracking are solid.

One suggestion to make it even better: _linux_pid_is_alive() does not work on macOS.

The Problem

The function gates on sys.platform.startswith("linux") at the top:

if not sys.platform.startswith("linux") or pid <= 0:
    return None

On macOS (sys.platform == "darwin"), this always returns None. The orphan recovery can never detect that an owner process has died, so it falls back to waiting for the full HERMES_CRON_TIMEOUT expiry — which can be hours for long-running automation jobs.

Why This Matters

os.kill(pid, 0) is POSIX-standard and works on macOS. It sends signal 0 which checks liveness without killing the process:

Process alive → returns normally
Process dead → raises ProcessLookupError (or OSError with ESRCH)

The only Linux-specific part is the zombie state check via /proc/{pid}/stat, which should remain guarded.

The Fix

 def _linux_pid_is_alive(pid: int) -> Optional[bool]:
-    if not sys.platform.startswith("linux") or pid <= 0:
+    if pid <= 0:
         return None
     try:
         os.kill(pid, 0)
     ...
-    if alive:
+    if alive and sys.platform.startswith("linux"):
         state = _linux_process_state(pid)
         if state in {"Z", "X", "x"}:
             return False
     return alive

Two changes:

Allow os.kill(pid, 0) on all POSIX platforms (macOS included)
Keep /proc zombie check Linux-only

Testing

Verified on macOS with a stuck cron job — orphan recovery now detects dead PIDs immediately instead of waiting for timeout. On Linux, behavior is unchanged.

Rubirub · 2026-04-12T08:02:19Z

@MiraiChino
Good catch. You were right that _linux_pid_is_alive() was effectively Linux-only and made macOS fall back to timeout-based orphan recovery.

I fixed that issue, and while touching the ownership path I also extended the owner fingerprinting so macOS now participates in the full claim/recovery flow instead of only getting dead-PID detection.

What changed:

PID liveness now uses os.kill(pid, 0) on macOS too
Linux zombie-state handling stays Linux-only
macOS now records/verifies owner boot/process fingerprints during claim/recovery as well

I also added tests covering:

macOS PID liveness
matching macOS owner identity
macOS PID reuse / fingerprint mismatch
claim metadata persistence on macOS
early orphan recovery on macOS

ratacat · 2026-04-25T18:37:47Z

Strong support for this PR. This is exactly the scheduler-side fix our incident needed.

We hit the production version of this failure mode on macOS/launchd: one openai-codex cron job wedged, the gateway kept holding ~/.hermes/cron/.tick.lock, unrelated cron jobs stopped running, and after restart some overdue jobs were fast-forwarded/skipped because they were outside the grace window. Moving job execution outside the global scheduler lock, with per-job ownership/non-overlap, is the right architectural boundary.

Things I especially like:

short global lock only for scheduler metadata transitions
persisted in_flight ownership with run_id
finalization only if the same run_id still owns the claim
output saved before finalization
stale/orphan recovery rather than clearing claims at shutdown
macOS owner-liveness/fingerprint coverage, which matters for launchd users

Small elegance suggestions, non-blocking:

The current default cron.max_parallel_jobs: 1 is safe, but operationally it means a wedged worker can still consume the only dispatch slot. It no longer holds .tick.lock, which is a big improvement, but unrelated jobs may still starve in the same gateway process. Maybe document this explicitly in cron troubleshooting / config docs, or consider a default of 2 if the intent is to protect unrelated jobs from one stuck run.
_current_owner_metadata() in scheduler.py currently reaches into platform-specific/private helpers from cron.jobs. Since claim_due_jobs() already knows how to fill platform fingerprints when metadata is None, an even cleaner boundary might be for the scheduler to pass only stable owner identity (owner_instance_id, owner_pid) and let cron.jobs own all platform fingerprinting. Not a correctness blocker, just a separation-of-concerns tweak.
The comments around recovery semantics could call out that timeout/orphan recovery counts as a failed attempt and increments repeat accounting. That seems intentional and conservative, but it is an important operational behavior.

Overall: this would have prevented the global scheduler-lock blast radius we saw, and paired with the Codex transport fix in #12953 it looks like the right full fix path.

alt-glitch · 2026-04-25T18:41:45Z

Related to #9965 — both implement parallel cron job execution. This PR appears more comprehensive with ownership metadata and orphan recovery.

Rubirub · 2026-04-26T09:14:08Z

@ratacat

Thanks — this is exactly the failure mode this PR is trying to contain.

On cron.max_parallel_jobs: agreed with the concern. The latest rewrite keeps upstream’s default None rather than forcing 1, so by default a wedged worker does not consume the only dispatch slot. Users can still explicitly set 1 if they want serial cron execution.

I also applied the separation-of-concerns suggestion: _current_owner_metadata() now only passes stable scheduler identity (owner_instance_id, owner_pid), and cron.jobs owns the platform-specific boot/process fingerprint enrichment when jobs are claimed.

Finally, I clarified stale/orphan recovery accounting in the job-store docs. Recovery is intentionally recorded as a failed run attempt: it clears the in-flight claim, stores an error status, and advances repeat accounting conservatively rather than silently rewinding the job as if the claimed run never happened.

Rubirub · 2026-04-26T09:19:04Z

@alt-glitch I have rebased with all the new changes from the upstream, retested, and it is ready for merging once again.

This was referenced Apr 10, 2026

Long-running cron jobs block the scheduler tick loop #3752

Closed

fix: hermes update auto-restart kills in-process cron worker with no opt-out #6702

Open

Rubirub force-pushed the feat/cron-parallel-safe-non-overlap branch from 3811290 to fb37cc1 Compare April 12, 2026 08:08

alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management labels Apr 25, 2026

Rubirub force-pushed the feat/cron-parallel-safe-non-overlap branch from fb37cc1 to d7de3bb Compare April 26, 2026 07:27

ARegalado1 mentioned this pull request May 3, 2026

fix(update): allow skipping gateway auto-restart #6740

Closed

Rubirub added 4 commits May 11, 2026 14:19

fix(cron): port ownership-safe parallel scheduler

dc2b11b

fix(cron): keep owner fingerprinting in job store

3ea3cae

fix(cron): tolerate hot-upgraded auto delivery context

5b5ced8

fix: preserve scoped stop and cron compatibility after rebase

b3ccabe

Rubirub force-pushed the feat/cron-parallel-safe-non-overlap branch from 2b32adb to b3ccabe Compare May 11, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cron): run independent cron jobs in parallel with safe non-overlap#7158

feat(cron): run independent cron jobs in parallel with safe non-overlap#7158
Rubirub wants to merge 4 commits into
NousResearch:mainfrom
Rubirub:feat/cron-parallel-safe-non-overlap

Rubirub commented Apr 10, 2026 •

edited

Loading

Uh oh!

MiraiChino commented Apr 12, 2026 •

edited

Loading

Uh oh!

Rubirub commented Apr 12, 2026 •

edited

Loading

Uh oh!

ratacat commented Apr 25, 2026

Uh oh!

alt-glitch commented Apr 25, 2026

Uh oh!

Rubirub commented Apr 26, 2026

Uh oh!

Rubirub commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Rubirub commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Safety properties

Test Plan

Issue links

Uh oh!

MiraiChino commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

Why This Matters

The Fix

Testing

Uh oh!

Rubirub commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ratacat commented Apr 25, 2026

Uh oh!

alt-glitch commented Apr 25, 2026

Uh oh!

Rubirub commented Apr 26, 2026

Uh oh!

Rubirub commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rubirub commented Apr 10, 2026 •

edited

Loading

MiraiChino commented Apr 12, 2026 •

edited

Loading

Rubirub commented Apr 12, 2026 •

edited

Loading