Problem
When any cron job runs a long-running task (e.g. an Opus delegation lasting 2–4 min), tick() holds an exclusive fcntl.LOCK_EX lock for the entire duration of all jobs in the batch — not just the scheduling decision. This causes every subsequent 60s ticker attempt to hit the lock, skip, and return 0.
Combined with the grace-window logic in compute_next_run (half-period, capped 2min–2hr), when the lock finally releases, the missed-run window has been exceeded and all overdue jobs fast-forward to now + interval instead of catching up. Missed runs are silently dropped.
Observed impact
Production setup running 4 pdp-v1 epic crons (15m + 5m + 5m + 15m intervals). During an active interactive session with several Opus delegations:
- A 5m-interval cron had a 68-minute gap between runs (last: 16:22, next computed: 17:33)
- An autonomous coding epic made zero progress for 2+ hours while human oversight was live
- No error logged, no alert — completely silent failure
Root cause (code refs)
-
cron/scheduler.py::tick() — fcntl.LOCK_EX acquired at entry, held until ThreadPoolExecutor.__exit__ (i.e. all jobs complete). The lock is not needed during job execution — only during the get_due_jobs() + advance_next_run() critical section.
-
cron/jobs.py::compute_next_run() — grace window for interval-kind jobs is half-period (min 2m, max 2h). When grace is exceeded, falls through to now + interval with no catch-up.
Proposed fixes
Fix 1 — Release lock after dispatch, not after completion (~30 LOC)
# Acquire lock
with tick_lock:
due_jobs = get_due_jobs()
for job in due_jobs:
advance_next_run(job["id"]) # at-most-once preserved
# Lock released here — jobs run outside the critical section
with ThreadPoolExecutor(max_workers=_max_workers) as pool:
futures = [pool.submit(run_job, job) for job in due_jobs]
...
advance_next_run() already sets next_run_at before any job starts, so at-most-once semantics are preserved without holding the lock during execution.
Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)
For kind=interval, advance to the smallest last_run + N×interval > now rather than now + interval. This preserves cadence without accumulating missed runs.
Alternatively: cap grace at 1×period instead of 0.5×period so a 15m job tolerates a 15m delay.
Fix 3 — Cap max_parallel_jobs default (~1 LOC)
max_parallel_jobs: is currently unbounded (empty). Default to 4 or 2 to prevent N concurrent heavy jobs from holding the lock indefinitely.
Workaround
Set cron.max_parallel_jobs: 2 in config.yaml. This limits the blast radius but does not fix the root cause (lock held during execution).
Notes
- A separate scheduler process does not fix this — same lock semantics apply, and Discord delivery has no standalone-process adapter.
- The
hermes cron run manual trigger resets next_run_at = now + interval, causing further schedule drift. Avoid using it to "unstick" a stalled scheduler.
- Confirmed on: Linux (6.8.0), gateway mode (Discord), 4 active interval crons, Anthropic/OpenRouter provider.
Problem
When any cron job runs a long-running task (e.g. an Opus delegation lasting 2–4 min),
tick()holds an exclusivefcntl.LOCK_EXlock for the entire duration of all jobs in the batch — not just the scheduling decision. This causes every subsequent 60s ticker attempt to hit the lock, skip, and return 0.Combined with the grace-window logic in
compute_next_run(half-period, capped 2min–2hr), when the lock finally releases, the missed-run window has been exceeded and all overdue jobs fast-forward tonow + intervalinstead of catching up. Missed runs are silently dropped.Observed impact
Production setup running 4
pdp-v1epic crons (15m + 5m + 5m + 15m intervals). During an active interactive session with several Opus delegations:Root cause (code refs)
cron/scheduler.py::tick()—fcntl.LOCK_EXacquired at entry, held untilThreadPoolExecutor.__exit__(i.e. all jobs complete). The lock is not needed during job execution — only during theget_due_jobs()+advance_next_run()critical section.cron/jobs.py::compute_next_run()— grace window forinterval-kind jobs is half-period (min 2m, max 2h). When grace is exceeded, falls through tonow + intervalwith no catch-up.Proposed fixes
Fix 1 — Release lock after dispatch, not after completion (~30 LOC)
advance_next_run()already setsnext_run_atbefore any job starts, so at-most-once semantics are preserved without holding the lock during execution.Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)
For
kind=interval, advance to the smallestlast_run + N×interval > nowrather thannow + interval. This preserves cadence without accumulating missed runs.Alternatively: cap grace at
1×periodinstead of0.5×periodso a 15m job tolerates a 15m delay.Fix 3 — Cap max_parallel_jobs default (~1 LOC)
max_parallel_jobs:is currently unbounded (empty). Default to4or2to prevent N concurrent heavy jobs from holding the lock indefinitely.Workaround
Set
cron.max_parallel_jobs: 2inconfig.yaml. This limits the blast radius but does not fix the root cause (lock held during execution).Notes
hermes cron runmanual trigger resetsnext_run_at = now + interval, causing further schedule drift. Avoid using it to "unstick" a stalled scheduler.