fix(cron): release tick file lock before executing due jobs#27492
fix(cron): release tick file lock before executing due jobs#27492briandevans wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR fixes a scheduler regression where the tick file lock was held during job execution, causing subsequent ticks to be skipped and missed runs to be silently dropped.
Changes:
- Refactors
tick()to release the OS file lock immediately afteradvance_next_run()completes (before executing any jobs). - Adds a regression test asserting the lock is released before any
run_job()call.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
cron/scheduler.py |
Shrinks the file-lock critical section to only due-job selection + next-run advancement; runs jobs after unlocking. |
tests/cron/test_scheduler.py |
Adds regression coverage ensuring the lock is released before jobs execute. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Partition due jobs: those with a per-job workdir mutate | ||
| # os.environ["TERMINAL_CWD"] inside run_job, which is process-global — | ||
| # so they MUST run sequentially to avoid corrupting each other. Jobs | ||
| # without a workdir leave env untouched and stay parallel-safe. | ||
| workdir_jobs = [j for j in due_jobs if (j.get("workdir") or "").strip()] | ||
| parallel_jobs = [j for j in due_jobs if not (j.get("workdir") or "").strip()] | ||
|
|
||
| _results: list = [] | ||
|
|
||
| # Sequential pass for workdir jobs. | ||
| for job in workdir_jobs: | ||
| _ctx = contextvars.copy_context() | ||
| _results.append(_ctx.run(_process_job, job)) |
| # Resolve max parallel workers: env var > config.yaml > unbounded. | ||
| # Set HERMES_CRON_MAX_PARALLEL=1 to restore old serial behaviour. | ||
| _max_workers: Optional[int] = None | ||
| try: | ||
| _env_par = os.getenv("HERMES_CRON_MAX_PARALLEL", "").strip() | ||
| if _env_par: | ||
| _max_workers = int(_env_par) or None | ||
| except (ValueError, TypeError): | ||
| logger.warning("Invalid HERMES_CRON_MAX_PARALLEL value; defaulting to unbounded") | ||
| if _max_workers is None: | ||
| try: | ||
| _ucfg = load_config() or {} | ||
| _cfg_par = ( | ||
| _ucfg.get("cron", {}) if isinstance(_ucfg, dict) else {} | ||
| ).get("max_parallel_jobs") | ||
| if _cfg_par is not None: | ||
| _max_workers = int(_cfg_par) or None | ||
| except Exception: | ||
| pass |
| from cron import scheduler | ||
|
|
||
| if scheduler.fcntl is None and scheduler.msvcrt is None: | ||
| pytest.skip("no file-locking primitive on this platform") | ||
|
|
||
| events: list = [] |
|
CI audit — all 21 test failures are pre-existing baselines on clean Same set as the audit I posted on #27514 a few hours ago — identical 21 failures with identical error text (acp version bump, Happy to address any failure that's actually in scope here. |
62e2afb to
a839210
Compare
90a1310 to
16ceedf
Compare
fb819fe to
2056b31
Compare
|
I verified this lock-narrowing fix is correct and the critical section boundary is well-placed. Critical section scope: The try/finally now covers only Lock release before execution: Confirmed the file lock is released (in the Variable scoping: Concurrency semantics: The sequential/parallel job partitioning and Edge case: If |
b97e0cc to
c8ea168
Compare
Previously the fcntl/msvcrt LOCK_EX taken at the top of `tick()` was held until the `ThreadPoolExecutor` exited — i.e. until every due job had finished running. For a multi-minute Opus delegation that meant every 60s ticker attempt during the run hit the lock, returned 0, and combined with `compute_next_run`'s grace window (half-period, capped at 2h) silently dropped overdue interval jobs instead of catching up. The lock only needs to serialize the critical section that selects due jobs and advances their `next_run_at`. Once `advance_next_run()` has bumped each job's `next_run_at`, any concurrent tick that picks up the lock afterwards will see the same jobs as no-longer-due, so at-most-once semantics are preserved without holding the lock through the rest of the function. The body of `tick()` is restructured so the lock is released in a `finally` immediately after that critical section; job execution, MCP orphan cleanup, and the result accounting all run lock-free. Behavior of `get_due_jobs`, `advance_next_run`, `_process_job`, the workdir vs parallel partitioning, and the `HERMES_CRON_MAX_PARALLEL` / `cron.max_parallel_jobs` resolution is otherwise unchanged. Regression test in `TestParallelTick` patches `fcntl.flock` / `msvcrt.locking` to record `LOCK_UN`/`LK_UNLCK` events alongside `run_job` calls and asserts the lock is released before any job starts. The same test against the pre-fix scheduler shows the expected event order `run_job, run_job, lock_release`. Fixes NousResearch#27485. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c8ea168 to
48b9823
Compare
|
Thanks for the focused cron fix — the lock-boundary idea is still useful, but the surrounding scheduler has moved on current main. Problems
Suggested changes
Automated hermes-sweeper review. |
Summary
cron/scheduler.py::tick()currently holds thefcntl/msvcrtLOCK_EXfor the full duration of every due job, not just the scheduling decision. A long-running delegation (multi-minute Opus run) starves every 60s ticker for the entire job lifetime, and combined withcompute_next_run's grace window this silently drops overdue interval runs instead of catching up. This PR releases the lock as soon as the critical section (select due jobs + bumpnext_run_at) is done; job execution then runs lock-free.Fixes #27485.
The bug
Observed effect (reported on the issue): a 5-minute interval cron had a 68-minute gap between runs because a parallel Opus job was holding the lock. The repro is just "configure a few interval crons, fire a multi-minute delegation, watch subsequent ticks skip with
Tick skipped — another instance holds the lock."The lock only needs to serialize across tick instances long enough that
advance_next_run()can bump each job'snext_run_at. After that, any concurrent tick that picks up the lock will see those jobs as no-longer-due and exit cleanly — so at-most-once semantics are preserved without the lock spanning execution.The fix
Split
tick()so thefcntl/msvcrtlock wraps onlyget_due_jobs()and theadvance_next_run()loop. The lock is released in afinallyimmediately after that critical section; job execution, MCP orphan cleanup, and result accounting all run lock-free.Concretely, the inner block (worker resolution,
_process_job, workdir vs parallel partitioning, theThreadPoolExecutor, the MCP cleanup, andreturn sum(_results)) was lifted out of thetryso it sits after thefinallythat releases the lock. No behavior change toget_due_jobs,advance_next_run,_process_job, the workdir-vs-parallel split, theHERMES_CRON_MAX_PARALLEL/cron.max_parallel_jobsresolution, or the existingMagicMock-compatible early-return onverbose and not due_jobs.This is Fix 1 from the issue's proposal. Fix 2 (better grace / catch-up for interval jobs) and Fix 3 (cap
max_parallel_jobsdefault) are deliberately not in scope here — they address separate root causes and would be better as their own PRs.Test plan
TestParallelTick::test_tick_releases_file_lock_before_running_jobs: patchesfcntl.flock/msvcrt.lockingto logLOCK_UN/LK_UNLCKevents alongsiderun_jobcalls and asserts the lock is released before any job starts.[('run_job', 'long-1'), ('run_job', 'long-2'), ('lock_release',)]. With the fix applied the order flips and the assertion passes.uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/cron/test_scheduler.py -v—122 passed.tests/cron/—348 passed._process_jobbody, workdir vs parallel partitioning,as_completed(timeout=600)per-future handling (#5ed-style protection from7a7e78a36), and the MCP orphan cleanup are all preserved verbatim.Related
7a7e78a36(fix(cron): prevent parallel job result loss on exception) — that commit hardened result collection inside the pool; this PR pulls the pool itself out of the critical section.Positioning vs #21901 (
perf(cron): narrow file lock scope + heartbeat ticker)Both PRs move job execution outside the
fcntlcritical section. The differences are scope and execution semantics — picking one over the other is a real call for maintainers, not a pure dedup:as_completed(timeout=600))ThreadPoolExecutor.submit+shutdown(wait=False))TERMINAL_CWDenv mutation needs serial execution per the existing partitioning)as_completedresult accountingshutdown(wait=False)discards future results)gateway/run.pytick()onlyThis PR is intentionally the minimal change for the missed-runs symptom in #27485 — same fix-1 outcome, no change to execution semantics or workdir safety, plus the regression test that asserts the lock is released before any
run_job()call. Happy to defer to #21901 if the broader execution-model + observability changes are wanted in one go.