feat(cron): run independent cron jobs in parallel with safe non-overlap#7158
feat(cron): run independent cron jobs in parallel with safe non-overlap#7158Rubirub wants to merge 4 commits into
Conversation
|
This is a great PR — the orphan recovery mechanism is exactly what cron needs to handle gateway restarts cleanly. The per-job locking and in-flight tracking are solid. One suggestion to make it even better: The ProblemThe function gates on if not sys.platform.startswith("linux") or pid <= 0:
return NoneOn macOS ( Why This Matters
The only Linux-specific part is the zombie state check via The Fix def _linux_pid_is_alive(pid: int) -> Optional[bool]:
- if not sys.platform.startswith("linux") or pid <= 0:
+ if pid <= 0:
return None
try:
os.kill(pid, 0)
...
- if alive:
+ if alive and sys.platform.startswith("linux"):
state = _linux_process_state(pid)
if state in {"Z", "X", "x"}:
return False
return aliveTwo changes:
TestingVerified on macOS with a stuck cron job — orphan recovery now detects dead PIDs immediately instead of waiting for timeout. On Linux, behavior is unchanged. |
|
@MiraiChino I fixed that issue, and while touching the ownership path I also extended the owner fingerprinting so macOS now participates in the full claim/recovery flow instead of only getting dead-PID detection. What changed:
I also added tests covering:
|
3811290 to
fb37cc1
Compare
|
Strong support for this PR. This is exactly the scheduler-side fix our incident needed. We hit the production version of this failure mode on macOS/launchd: one Things I especially like:
Small elegance suggestions, non-blocking:
Overall: this would have prevented the global scheduler-lock blast radius we saw, and paired with the Codex transport fix in #12953 it looks like the right full fix path. |
|
Related to #9965 — both implement parallel cron job execution. This PR appears more comprehensive with ownership metadata and orphan recovery. |
fb37cc1 to
d7de3bb
Compare
|
Thanks — this is exactly the failure mode this PR is trying to contain. On cron.max_parallel_jobs: agreed with the concern. The latest rewrite keeps upstream’s default None rather than forcing 1, so by default a wedged worker does not consume the only dispatch slot. Users can still explicitly set 1 if they want serial cron execution. I also applied the separation-of-concerns suggestion: _current_owner_metadata() now only passes stable scheduler identity (owner_instance_id, owner_pid), and cron.jobs owns the platform-specific boot/process fingerprint enrichment when jobs are claimed. Finally, I clarified stale/orphan recovery accounting in the job-store docs. Recovery is intentionally recorded as a failed run attempt: it clears the in-flight claim, stores an error status, and advances repeat accounting conservatively rather than silently rewinding the job as if the claimed run never happened. |
|
@alt-glitch I have rebased with all the new changes from the upstream, retested, and it is ready for merging once again. |
2b32adb to
b3ccabe
Compare
Summary
Design
The scheduler still performs recovery before claim on each tick. Claimed work now carries owner metadata so different jobs can run concurrently while the same job remains protected from overlap. If a gateway instance dies after claiming work, the next instance can recover definitely orphaned claims after a short grace period. When owner liveness is unclear, behavior stays conservative and falls back to timeout-based recovery.
Safety properties
Test Plan
Issue links
hermes updateauto-restart kills in-process cron worker with no opt-out #6702