Skip to content

cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs #27485

@hermes-agent-hp

Description

@hermes-agent-hp

Problem

When any cron job runs a long-running task (e.g. an Opus delegation lasting 2–4 min), tick() holds an exclusive fcntl.LOCK_EX lock for the entire duration of all jobs in the batch — not just the scheduling decision. This causes every subsequent 60s ticker attempt to hit the lock, skip, and return 0.

Combined with the grace-window logic in compute_next_run (half-period, capped 2min–2hr), when the lock finally releases, the missed-run window has been exceeded and all overdue jobs fast-forward to now + interval instead of catching up. Missed runs are silently dropped.

Observed impact

Production setup running 4 pdp-v1 epic crons (15m + 5m + 5m + 15m intervals). During an active interactive session with several Opus delegations:

  • A 5m-interval cron had a 68-minute gap between runs (last: 16:22, next computed: 17:33)
  • An autonomous coding epic made zero progress for 2+ hours while human oversight was live
  • No error logged, no alert — completely silent failure

Root cause (code refs)

  1. cron/scheduler.py::tick()fcntl.LOCK_EX acquired at entry, held until ThreadPoolExecutor.__exit__ (i.e. all jobs complete). The lock is not needed during job execution — only during the get_due_jobs() + advance_next_run() critical section.

  2. cron/jobs.py::compute_next_run() — grace window for interval-kind jobs is half-period (min 2m, max 2h). When grace is exceeded, falls through to now + interval with no catch-up.

Proposed fixes

Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

# Acquire lock
with tick_lock:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # at-most-once preserved
# Lock released here — jobs run outside the critical section

with ThreadPoolExecutor(max_workers=_max_workers) as pool:
    futures = [pool.submit(run_job, job) for job in due_jobs]
    ...

advance_next_run() already sets next_run_at before any job starts, so at-most-once semantics are preserved without holding the lock during execution.

Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

For kind=interval, advance to the smallest last_run + N×interval > now rather than now + interval. This preserves cadence without accumulating missed runs.

Alternatively: cap grace at 1×period instead of 0.5×period so a 15m job tolerates a 15m delay.

Fix 3 — Cap max_parallel_jobs default (~1 LOC)

max_parallel_jobs: is currently unbounded (empty). Default to 4 or 2 to prevent N concurrent heavy jobs from holding the lock indefinitely.

Workaround

Set cron.max_parallel_jobs: 2 in config.yaml. This limits the blast radius but does not fix the root cause (lock held during execution).

Notes

  • A separate scheduler process does not fix this — same lock semantics apply, and Discord delivery has no standalone-process adapter.
  • The hermes cron run manual trigger resets next_run_at = now + interval, causing further schedule drift. Avoid using it to "unstick" a stalled scheduler.
  • Confirmed on: Linux (6.8.0), gateway mode (Discord), 4 active interval crons, Anthropic/OpenRouter provider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/cronCron scheduler and job managementsweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions