Skip to content

feat(cron): run independent cron jobs in parallel with safe non-overlap#7158

Open
Rubirub wants to merge 4 commits into
NousResearch:mainfrom
Rubirub:feat/cron-parallel-safe-non-overlap
Open

feat(cron): run independent cron jobs in parallel with safe non-overlap#7158
Rubirub wants to merge 4 commits into
NousResearch:mainfrom
Rubirub:feat/cron-parallel-safe-non-overlap

Conversation

@Rubirub

@Rubirub Rubirub commented Apr 10, 2026

Copy link
Copy Markdown

Summary

  • run independent cron jobs in parallel instead of serializing scheduler execution
  • preserve per-job non-overlap with claim-time ownership and job-specific locking
  • recover orphaned in-flight jobs conservatively after gateway restarts while preserving timeout fallback and legacy owner compatibility
  • keep recovery, claim ordering, and run finalization semantics consistent with the existing cron model

Design

The scheduler still performs recovery before claim on each tick. Claimed work now carries owner metadata so different jobs can run concurrently while the same job remains protected from overlap. If a gateway instance dies after claiming work, the next instance can recover definitely orphaned claims after a short grace period. When owner liveness is unclear, behavior stays conservative and falls back to timeout-based recovery.

Safety properties

  • different jobs may run concurrently
  • the same job cannot overlap with itself
  • stale completions are discarded if they no longer own the claim
  • repeat accounting and run finalization continue to use the shared outcome path
  • legacy persisted owner metadata still falls back safely

Test Plan

  • python -m pytest tests/cron/test_jobs.py tests/cron/test_scheduler.py tests/hermes_cli/test_cron.py tests/tools/test_cronjob_tools.py -q -o addopts=''

Issue links

@MiraiChino

MiraiChino commented Apr 12, 2026

Copy link
Copy Markdown

This is a great PR — the orphan recovery mechanism is exactly what cron needs to handle gateway restarts cleanly. The per-job locking and in-flight tracking are solid.

One suggestion to make it even better: _linux_pid_is_alive() does not work on macOS.

The Problem

The function gates on sys.platform.startswith("linux") at the top:

if not sys.platform.startswith("linux") or pid <= 0:
    return None

On macOS (sys.platform == "darwin"), this always returns None. The orphan recovery can never detect that an owner process has died, so it falls back to waiting for the full HERMES_CRON_TIMEOUT expiry — which can be hours for long-running automation jobs.

Why This Matters

os.kill(pid, 0) is POSIX-standard and works on macOS. It sends signal 0 which checks liveness without killing the process:

  • Process alive → returns normally
  • Process dead → raises ProcessLookupError (or OSError with ESRCH)

The only Linux-specific part is the zombie state check via /proc/{pid}/stat, which should remain guarded.

The Fix

 def _linux_pid_is_alive(pid: int) -> Optional[bool]:
-    if not sys.platform.startswith("linux") or pid <= 0:
+    if pid <= 0:
         return None
     try:
         os.kill(pid, 0)
     ...
-    if alive:
+    if alive and sys.platform.startswith("linux"):
         state = _linux_process_state(pid)
         if state in {"Z", "X", "x"}:
             return False
     return alive

Two changes:

  1. Allow os.kill(pid, 0) on all POSIX platforms (macOS included)
  2. Keep /proc zombie check Linux-only

Testing

Verified on macOS with a stuck cron job — orphan recovery now detects dead PIDs immediately instead of waiting for timeout. On Linux, behavior is unchanged.

@Rubirub

Rubirub commented Apr 12, 2026

Copy link
Copy Markdown
Author

@MiraiChino
Good catch. You were right that _linux_pid_is_alive() was effectively Linux-only and made macOS fall back to timeout-based orphan recovery.

I fixed that issue, and while touching the ownership path I also extended the owner fingerprinting so macOS now participates in the full claim/recovery flow instead of only getting dead-PID detection.

What changed:

  • PID liveness now uses os.kill(pid, 0) on macOS too
  • Linux zombie-state handling stays Linux-only
  • macOS now records/verifies owner boot/process fingerprints during claim/recovery as well

I also added tests covering:

  • macOS PID liveness
  • matching macOS owner identity
  • macOS PID reuse / fingerprint mismatch
  • claim metadata persistence on macOS
  • early orphan recovery on macOS

@Rubirub Rubirub force-pushed the feat/cron-parallel-safe-non-overlap branch from 3811290 to fb37cc1 Compare April 12, 2026 08:08
@ratacat

ratacat commented Apr 25, 2026

Copy link
Copy Markdown

Strong support for this PR. This is exactly the scheduler-side fix our incident needed.

We hit the production version of this failure mode on macOS/launchd: one openai-codex cron job wedged, the gateway kept holding ~/.hermes/cron/.tick.lock, unrelated cron jobs stopped running, and after restart some overdue jobs were fast-forwarded/skipped because they were outside the grace window. Moving job execution outside the global scheduler lock, with per-job ownership/non-overlap, is the right architectural boundary.

Things I especially like:

  • short global lock only for scheduler metadata transitions
  • persisted in_flight ownership with run_id
  • finalization only if the same run_id still owns the claim
  • output saved before finalization
  • stale/orphan recovery rather than clearing claims at shutdown
  • macOS owner-liveness/fingerprint coverage, which matters for launchd users

Small elegance suggestions, non-blocking:

  1. The current default cron.max_parallel_jobs: 1 is safe, but operationally it means a wedged worker can still consume the only dispatch slot. It no longer holds .tick.lock, which is a big improvement, but unrelated jobs may still starve in the same gateway process. Maybe document this explicitly in cron troubleshooting / config docs, or consider a default of 2 if the intent is to protect unrelated jobs from one stuck run.

  2. _current_owner_metadata() in scheduler.py currently reaches into platform-specific/private helpers from cron.jobs. Since claim_due_jobs() already knows how to fill platform fingerprints when metadata is None, an even cleaner boundary might be for the scheduler to pass only stable owner identity (owner_instance_id, owner_pid) and let cron.jobs own all platform fingerprinting. Not a correctness blocker, just a separation-of-concerns tweak.

  3. The comments around recovery semantics could call out that timeout/orphan recovery counts as a failed attempt and increments repeat accounting. That seems intentional and conservative, but it is an important operational behavior.

Overall: this would have prevented the global scheduler-lock blast radius we saw, and paired with the Codex transport fix in #12953 it looks like the right full fix path.

@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #9965 — both implement parallel cron job execution. This PR appears more comprehensive with ownership metadata and orphan recovery.

@Rubirub Rubirub force-pushed the feat/cron-parallel-safe-non-overlap branch from fb37cc1 to d7de3bb Compare April 26, 2026 07:27
@Rubirub

Rubirub commented Apr 26, 2026

Copy link
Copy Markdown
Author

@ratacat

Thanks — this is exactly the failure mode this PR is trying to contain.

On cron.max_parallel_jobs: agreed with the concern. The latest rewrite keeps upstream’s default None rather than forcing 1, so by default a wedged worker does not consume the only dispatch slot. Users can still explicitly set 1 if they want serial cron execution.

I also applied the separation-of-concerns suggestion: _current_owner_metadata() now only passes stable scheduler identity (owner_instance_id, owner_pid), and cron.jobs owns the platform-specific boot/process fingerprint enrichment when jobs are claimed.

Finally, I clarified stale/orphan recovery accounting in the job-store docs. Recovery is intentionally recorded as a failed run attempt: it clears the in-flight claim, stores an error status, and advances repeat accounting conservatively rather than silently rewinding the job as if the claimed run never happened.

@Rubirub

Rubirub commented Apr 26, 2026

Copy link
Copy Markdown
Author

@alt-glitch I have rebased with all the new changes from the upstream, retested, and it is ready for merging once again.

@Rubirub Rubirub force-pushed the feat/cron-parallel-safe-non-overlap branch from 2b32adb to b3ccabe Compare May 11, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long-running cron jobs block the scheduler tick loop

4 participants