fix(cron): release tick file lock before executing due jobs by briandevans · Pull Request #27492 · NousResearch/hermes-agent

briandevans · 2026-05-17T16:16:33Z

Summary

cron/scheduler.py::tick() currently holds the fcntl/msvcrt LOCK_EX for the full duration of every due job, not just the scheduling decision. A long-running delegation (multi-minute Opus run) starves every 60s ticker for the entire job lifetime, and combined with compute_next_run's grace window this silently drops overdue interval runs instead of catching up. This PR releases the lock as soon as the critical section (select due jobs + bump next_run_at) is done; job execution then runs lock-free.

Fixes #27485.

The bug

# cron/scheduler.py::tick()
fcntl.flock(lock_fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
try:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # critical section ends here
    ...
    with ThreadPoolExecutor(...) as pool:
        for job in due_jobs:
            pool.submit(_process_job, job)   # ← lock still held
        # ThreadPoolExecutor.__exit__ blocks until all futures complete
finally:
    fcntl.flock(lock_fd, fcntl.LOCK_UN)       # ← only now

Observed effect (reported on the issue): a 5-minute interval cron had a 68-minute gap between runs because a parallel Opus job was holding the lock. The repro is just "configure a few interval crons, fire a multi-minute delegation, watch subsequent ticks skip with Tick skipped — another instance holds the lock."

The lock only needs to serialize across tick instances long enough that advance_next_run() can bump each job's next_run_at. After that, any concurrent tick that picks up the lock will see those jobs as no-longer-due and exit cleanly — so at-most-once semantics are preserved without the lock spanning execution.

The fix

Split tick() so the fcntl/msvcrt lock wraps only get_due_jobs() and the advance_next_run() loop. The lock is released in a finally immediately after that critical section; job execution, MCP orphan cleanup, and result accounting all run lock-free.

Concretely, the inner block (worker resolution, _process_job, workdir vs parallel partitioning, the ThreadPoolExecutor, the MCP cleanup, and return sum(_results)) was lifted out of the try so it sits after the finally that releases the lock. No behavior change to get_due_jobs, advance_next_run, _process_job, the workdir-vs-parallel split, the HERMES_CRON_MAX_PARALLEL / cron.max_parallel_jobs resolution, or the existing MagicMock-compatible early-return on verbose and not due_jobs.

This is Fix 1 from the issue's proposal. Fix 2 (better grace / catch-up for interval jobs) and Fix 3 (cap max_parallel_jobs default) are deliberately not in scope here — they address separate root causes and would be better as their own PRs.

Test plan

New regression test TestParallelTick::test_tick_releases_file_lock_before_running_jobs: patches fcntl.flock/msvcrt.locking to log LOCK_UN/LK_UNLCK events alongside run_job calls and asserts the lock is released before any job starts.
Regression guard: against the pre-fix scheduler, the new test fails with the expected event order — [('run_job', 'long-1'), ('run_job', 'long-2'), ('lock_release',)]. With the fix applied the order flips and the assertion passes.
Focused suite: uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/cron/test_scheduler.py -v — 122 passed.
Adjacent suite: full tests/cron/ — 348 passed.
No production behaviour changes outside the lock-scope refactor: _process_job body, workdir vs parallel partitioning, as_completed(timeout=600) per-future handling (#5ed-style protection from 7a7e78a36), and the MCP orphan cleanup are all preserved verbatim.

Fixes cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs #27485.
Complementary to 7a7e78a36 (fix(cron): prevent parallel job result loss on exception) — that commit hardened result collection inside the pool; this PR pulls the pool itself out of the critical section.

Positioning vs #21901 (`perf(cron): narrow file lock scope + heartbeat ticker`)

Both PRs move job execution outside the fcntl critical section. The differences are scope and execution semantics — picking one over the other is a real call for maintainers, not a pure dedup:

	#27492 (this PR)	#21901
Lock-narrowing	yes	yes
Execution model	preserved (workdir-sequential + parallel pool, waits for completion via `as_completed(timeout=600)`)	replaced with fire-and-forget (`ThreadPoolExecutor.submit` + `shutdown(wait=False)`)
Workdir-sequential safety	preserved verbatim (`TERMINAL_CWD` env mutation needs serial execution per the existing partitioning)	partitioning removed; all jobs go through fire-and-forget
`as_completed` result accounting	preserved	dropped (`shutdown(wait=False)` discards future results)
MCP orphan cleanup at end of tick	preserved	removed
Extra observability	none	heartbeat file every 5 ticks + extra error logging in `gateway/run.py`
Diff size	~50 / ~5 around `tick()` only	~148 / ~123 across two files

This PR is intentionally the minimal change for the missed-runs symptom in #27485 — same fix-1 outcome, no change to execution semantics or workdir safety, plus the regression test that asserts the lock is released before any run_job() call. Happy to defer to #21901 if the broader execution-model + observability changes are wanted in one go.

Sibling code paths that may need the same fix: cron/jobs.py::compute_next_run grace window and the unbounded cron.max_parallel_jobs default are both called out in the issue as separate contributing factors to missed runs. Intentionally left out of this PR's scope to keep the diff small — happy to widen or follow up in a separate PR if preferred.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a scheduler regression where the tick file lock was held during job execution, causing subsequent ticks to be skipped and missed runs to be silently dropped.

Changes:

Refactors tick() to release the OS file lock immediately after advance_next_run() completes (before executing any jobs).
Adds a regression test asserting the lock is released before any run_job() call.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`cron/scheduler.py`	Shrinks the file-lock critical section to only due-job selection + next-run advancement; runs jobs after unlocking.
`tests/cron/test_scheduler.py`	Adds regression coverage ensuring the lock is released before jobs execute.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # Partition due jobs: those with a per-job workdir mutate
+    # os.environ["TERMINAL_CWD"] inside run_job, which is process-global —
+    # so they MUST run sequentially to avoid corrupting each other.  Jobs
+    # without a workdir leave env untouched and stay parallel-safe.
+    workdir_jobs = [j for j in due_jobs if (j.get("workdir") or "").strip()]
+    parallel_jobs = [j for j in due_jobs if not (j.get("workdir") or "").strip()]
+
+    _results: list = []
+
+    # Sequential pass for workdir jobs.
+    for job in workdir_jobs:
+        _ctx = contextvars.copy_context()
+        _results.append(_ctx.run(_process_job, job))


+    # Resolve max parallel workers: env var > config.yaml > unbounded.
+    # Set HERMES_CRON_MAX_PARALLEL=1 to restore old serial behaviour.
+    _max_workers: Optional[int] = None
+    try:
+        _env_par = os.getenv("HERMES_CRON_MAX_PARALLEL", "").strip()
+        if _env_par:
+            _max_workers = int(_env_par) or None
+    except (ValueError, TypeError):
+        logger.warning("Invalid HERMES_CRON_MAX_PARALLEL value; defaulting to unbounded")
+    if _max_workers is None:
+        try:
+            _ucfg = load_config() or {}
+            _cfg_par = (
+                _ucfg.get("cron", {}) if isinstance(_ucfg, dict) else {}
+            ).get("max_parallel_jobs")
+            if _cfg_par is not None:
+                _max_workers = int(_cfg_par) or None
+        except Exception:
+            pass


+        from cron import scheduler
+
+        if scheduler.fcntl is None and scheduler.msvcrt is None:
+            pytest.skip("no file-locking primitive on this platform")
+
+        events: list = []


alt-glitch · 2026-05-17T16:35:11Z

Related to open PR #21901 (perf(cron): narrow file lock scope + heartbeat ticker) which targets the same lock-narrowing fix in cron/scheduler.py. Both PRs move job execution outside the fcntl critical section. #21901 also adds a heartbeat ticker. Maintainers should pick one and close the other.

briandevans · 2026-05-17T19:09:12Z

CI audit — all 21 test failures are pre-existing baselines on clean origin/main (f36c89c). Zero failures are in touched code (cron/scheduler.py, tests/cron/test_parallel_tick.py).

Same set as the audit I posted on #27514 a few hours ago — identical 21 failures with identical error text (acp version bump, trust_env kwarg, direct_messages_topic_id metadata, _BUILTIN_SUBCOMMANDS missing send, etc.). Reproduces on main's push CI run 25989069834. Full per-test table at #27514.

Happy to address any failure that's actually in scope here.

liuhao1024 · 2026-06-03T10:26:31Z

I verified this lock-narrowing fix is correct and the critical section boundary is well-placed.

Critical section scope: The try/finally now covers only get_due_jobs() + advance_next_run() — the minimum code that must be serialized. Once advance_next_run() has updated each job's next_run_at, concurrent ticks see them as no-longer-due, preserving at-most-once semantics.

Lock release before execution: Confirmed the file lock is released (in the finally block) before _process_job runs any due job. This means a long-running delegation no longer blocks subsequent tick cycles, which was the root cause of missed runs in #27485.

Variable scoping: due_jobs is assigned inside the try block before the finally releases the lock, so it's safely available for the post-lock execution phase. If get_due_jobs() or advance_next_run() raises, the lock is still released and the exception propagates — due_jobs would be undefined but execution stops anyway.

Concurrency semantics: The sequential/parallel job partitioning and contextvars.copy_context() usage are unchanged — they were moved outside the lock but retain identical behavior. The test correctly verifies lock_release precedes the first run_job call by tracking events with indices.

Edge case: If advance_next_run() fails mid-loop (e.g., for job 3 of 5), jobs 1-2 are advanced but 3-5 are not. A concurrent tick would re-pick 3-5 and potentially double-execute them. This is a pre-existing issue (the old code had the same problem inside the lock) and not introduced by this PR — worth noting but not blocking.

Previously the fcntl/msvcrt LOCK_EX taken at the top of `tick()` was held until the `ThreadPoolExecutor` exited — i.e. until every due job had finished running. For a multi-minute Opus delegation that meant every 60s ticker attempt during the run hit the lock, returned 0, and combined with `compute_next_run`'s grace window (half-period, capped at 2h) silently dropped overdue interval jobs instead of catching up. The lock only needs to serialize the critical section that selects due jobs and advances their `next_run_at`. Once `advance_next_run()` has bumped each job's `next_run_at`, any concurrent tick that picks up the lock afterwards will see the same jobs as no-longer-due, so at-most-once semantics are preserved without holding the lock through the rest of the function. The body of `tick()` is restructured so the lock is released in a `finally` immediately after that critical section; job execution, MCP orphan cleanup, and the result accounting all run lock-free. Behavior of `get_due_jobs`, `advance_next_run`, `_process_job`, the workdir vs parallel partitioning, and the `HERMES_CRON_MAX_PARALLEL` / `cron.max_parallel_jobs` resolution is otherwise unchanged. Regression test in `TestParallelTick` patches `fcntl.flock` / `msvcrt.locking` to record `LOCK_UN`/`LK_UNLCK` events alongside `run_job` calls and asserts the lock is released before any job starts. The same test against the pre-fix scheduler shows the expected event order `run_job, run_job, lock_release`. Fixes NousResearch#27485. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

teknium1 · 2026-06-13T12:04:27Z

Thanks for the focused cron fix — the lock-boundary idea is still useful, but the surrounding scheduler has moved on current main.

Problems

The PR diff targets the old inline ThreadPoolExecutor/as_completed shape. Current cron/scheduler.py now has persistent pools and a running-job guard (_get_parallel_pool, _get_sequential_pool, _running_job_ids) at cron/scheduler.py:176-204 and dispatches through _submit_with_guard at cron/scheduler.py:2101-2123.
The production gateway ticker now calls cron_tick(..., sync=False) at gateway/run.py:15693, so the original “ticker waits for long-running jobs” path has already changed. Current main still unlocks in the final finally at cron/scheduler.py:2198-2209, after dispatch/sync waiting, so the exact release-before-run invariant is not implemented as-is.

Suggested changes

Salvage the lock narrowing onto the current tick() implementation: keep the lock around only get_due_jobs() + advance_next_run(), then release before _submit_with_guard/pool submission while preserving sync, persistent pools, _running_job_ids, and MCP cleanup callbacks.
Update the regression test to cover the current dispatch model, especially the production sync=False path.

Automated hermes-sweeper review.

Copilot AI review requested due to automatic review settings May 17, 2026 16:16

Copilot AI reviewed May 17, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working comp/cron Cron scheduler and job management P1 High — major feature broken, no workaround labels May 17, 2026

alt-glitch mentioned this pull request May 20, 2026

Fix scheduler wedge: scope tick lock to dispatch, add LLM stream timeouts #29350

Closed

briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 3 times, most recently from 62e2afb to a839210 Compare May 24, 2026 18:16

Ordinath mentioned this pull request May 26, 2026

fix(cron): release tick lock before running jobs #32786

Closed

briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 18 times, most recently from 90a1310 to 16ceedf Compare June 2, 2026 14:15

briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 12 times, most recently from fb819fe to 2056b31 Compare June 3, 2026 10:15

briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 12 times, most recently from b97e0cc to c8ea168 Compare June 4, 2026 11:17

briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch from c8ea168 to 48b9823 Compare June 4, 2026 12:12

alt-glitch mentioned this pull request Jun 10, 2026

fix(cron): release tick lock before job execution #43652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cron): release tick file lock before executing due jobs#27492

fix(cron): release tick file lock before executing due jobs#27492
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/cron-tick-release-lock-before-jobs-27485

briandevans commented May 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

alt-glitch commented May 17, 2026

Uh oh!

briandevans commented May 17, 2026

Uh oh!

liuhao1024 commented Jun 3, 2026

Uh oh!

teknium1 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

briandevans commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix

Test plan

Related

Positioning vs #21901 (perf(cron): narrow file lock scope + heartbeat ticker)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

alt-glitch commented May 17, 2026

Uh oh!

briandevans commented May 17, 2026

Uh oh!

liuhao1024 commented Jun 3, 2026

Uh oh!

teknium1 commented Jun 13, 2026

Problems

Suggested changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

briandevans commented May 17, 2026 •

edited

Loading

Positioning vs #21901 (`perf(cron): narrow file lock scope + heartbeat ticker`)