cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs

## Problem

When any cron job runs a long-running task (e.g. an Opus delegation lasting 2–4 min), `tick()` holds an exclusive `fcntl.LOCK_EX` lock for the **entire duration** of all jobs in the batch — not just the scheduling decision. This causes every subsequent 60s ticker attempt to hit the lock, skip, and return 0.

Combined with the grace-window logic in `compute_next_run` (half-period, capped 2min–2hr), when the lock finally releases, the missed-run window has been exceeded and all overdue jobs fast-forward to `now + interval` instead of catching up. **Missed runs are silently dropped.**

## Observed impact

Production setup running 4 `pdp-v1` epic crons (15m + 5m + 5m + 15m intervals). During an active interactive session with several Opus delegations:

- A 5m-interval cron had a **68-minute gap** between runs (last: 16:22, next computed: 17:33)
- An autonomous coding epic made **zero progress for 2+ hours** while human oversight was live
- No error logged, no alert — completely silent failure

## Root cause (code refs)

1. **`cron/scheduler.py::tick()`** — `fcntl.LOCK_EX` acquired at entry, held until `ThreadPoolExecutor.__exit__` (i.e. all jobs complete). The lock is not needed during job execution — only during the `get_due_jobs()` + `advance_next_run()` critical section.

2. **`cron/jobs.py::compute_next_run()`** — grace window for `interval`-kind jobs is half-period (min 2m, max 2h). When grace is exceeded, falls through to `now + interval` with no catch-up.

## Proposed fixes

### Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

```python
# Acquire lock
with tick_lock:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # at-most-once preserved
# Lock released here — jobs run outside the critical section

with ThreadPoolExecutor(max_workers=_max_workers) as pool:
    futures = [pool.submit(run_job, job) for job in due_jobs]
    ...
```

`advance_next_run()` already sets `next_run_at` before any job starts, so at-most-once semantics are preserved without holding the lock during execution.

### Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

For `kind=interval`, advance to the smallest `last_run + N×interval > now` rather than `now + interval`. This preserves cadence without accumulating missed runs.

Alternatively: cap grace at `1×period` instead of `0.5×period` so a 15m job tolerates a 15m delay.

### Fix 3 — Cap max_parallel_jobs default (~1 LOC)

`max_parallel_jobs:` is currently unbounded (empty). Default to `4` or `2` to prevent N concurrent heavy jobs from holding the lock indefinitely.

## Workaround

Set `cron.max_parallel_jobs: 2` in `config.yaml`. This limits the blast radius but does not fix the root cause (lock held during execution).

## Notes

- A separate scheduler process does **not** fix this — same lock semantics apply, and Discord delivery has no standalone-process adapter.
- The `hermes cron run` manual trigger resets `next_run_at = now + interval`, causing further schedule drift. Avoid using it to "unstick" a stalled scheduler.
- Confirmed on: Linux (6.8.0), gateway mode (Discord), 4 active interval crons, Anthropic/OpenRouter provider.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs #27485

Problem

Observed impact

Root cause (code refs)

Proposed fixes

Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

Fix 3 — Cap max_parallel_jobs default (~1 LOC)

Workaround

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs #27485

Description

Problem

Observed impact

Root cause (code refs)

Proposed fixes

Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

Fix 3 — Cap max_parallel_jobs default (~1 LOC)

Workaround

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions