Skip to content

fix(cron): release tick file lock before executing due jobs#27492

Open
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/cron-tick-release-lock-before-jobs-27485
Open

fix(cron): release tick file lock before executing due jobs#27492
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/cron-tick-release-lock-before-jobs-27485

Conversation

@briandevans

@briandevans briandevans commented May 17, 2026

Copy link
Copy Markdown
Contributor

Summary

cron/scheduler.py::tick() currently holds the fcntl/msvcrt LOCK_EX for the full duration of every due job, not just the scheduling decision. A long-running delegation (multi-minute Opus run) starves every 60s ticker for the entire job lifetime, and combined with compute_next_run's grace window this silently drops overdue interval runs instead of catching up. This PR releases the lock as soon as the critical section (select due jobs + bump next_run_at) is done; job execution then runs lock-free.

Fixes #27485.

The bug

# cron/scheduler.py::tick()
fcntl.flock(lock_fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
try:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # critical section ends here
    ...
    with ThreadPoolExecutor(...) as pool:
        for job in due_jobs:
            pool.submit(_process_job, job)   # ← lock still held
        # ThreadPoolExecutor.__exit__ blocks until all futures complete
finally:
    fcntl.flock(lock_fd, fcntl.LOCK_UN)       # ← only now

Observed effect (reported on the issue): a 5-minute interval cron had a 68-minute gap between runs because a parallel Opus job was holding the lock. The repro is just "configure a few interval crons, fire a multi-minute delegation, watch subsequent ticks skip with Tick skipped — another instance holds the lock."

The lock only needs to serialize across tick instances long enough that advance_next_run() can bump each job's next_run_at. After that, any concurrent tick that picks up the lock will see those jobs as no-longer-due and exit cleanly — so at-most-once semantics are preserved without the lock spanning execution.

The fix

Split tick() so the fcntl/msvcrt lock wraps only get_due_jobs() and the advance_next_run() loop. The lock is released in a finally immediately after that critical section; job execution, MCP orphan cleanup, and result accounting all run lock-free.

Concretely, the inner block (worker resolution, _process_job, workdir vs parallel partitioning, the ThreadPoolExecutor, the MCP cleanup, and return sum(_results)) was lifted out of the try so it sits after the finally that releases the lock. No behavior change to get_due_jobs, advance_next_run, _process_job, the workdir-vs-parallel split, the HERMES_CRON_MAX_PARALLEL / cron.max_parallel_jobs resolution, or the existing MagicMock-compatible early-return on verbose and not due_jobs.

This is Fix 1 from the issue's proposal. Fix 2 (better grace / catch-up for interval jobs) and Fix 3 (cap max_parallel_jobs default) are deliberately not in scope here — they address separate root causes and would be better as their own PRs.

Test plan

  • New regression test TestParallelTick::test_tick_releases_file_lock_before_running_jobs: patches fcntl.flock/msvcrt.locking to log LOCK_UN/LK_UNLCK events alongside run_job calls and asserts the lock is released before any job starts.
  • Regression guard: against the pre-fix scheduler, the new test fails with the expected event order — [('run_job', 'long-1'), ('run_job', 'long-2'), ('lock_release',)]. With the fix applied the order flips and the assertion passes.
  • Focused suite: uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/cron/test_scheduler.py -v122 passed.
  • Adjacent suite: full tests/cron/348 passed.
  • No production behaviour changes outside the lock-scope refactor: _process_job body, workdir vs parallel partitioning, as_completed(timeout=600) per-future handling (#5ed-style protection from 7a7e78a36), and the MCP orphan cleanup are all preserved verbatim.

Related

Positioning vs #21901 (perf(cron): narrow file lock scope + heartbeat ticker)

Both PRs move job execution outside the fcntl critical section. The differences are scope and execution semantics — picking one over the other is a real call for maintainers, not a pure dedup:

#27492 (this PR) #21901
Lock-narrowing yes yes
Execution model preserved (workdir-sequential + parallel pool, waits for completion via as_completed(timeout=600)) replaced with fire-and-forget (ThreadPoolExecutor.submit + shutdown(wait=False))
Workdir-sequential safety preserved verbatim (TERMINAL_CWD env mutation needs serial execution per the existing partitioning) partitioning removed; all jobs go through fire-and-forget
as_completed result accounting preserved dropped (shutdown(wait=False) discards future results)
MCP orphan cleanup at end of tick preserved removed
Extra observability none heartbeat file every 5 ticks + extra error logging in gateway/run.py
Diff size ~50 / ~5 around tick() only ~148 / ~123 across two files

This PR is intentionally the minimal change for the missed-runs symptom in #27485 — same fix-1 outcome, no change to execution semantics or workdir safety, plus the regression test that asserts the lock is released before any run_job() call. Happy to defer to #21901 if the broader execution-model + observability changes are wanted in one go.

Sibling code paths that may need the same fix: cron/jobs.py::compute_next_run grace window and the unbounded cron.max_parallel_jobs default are both called out in the issue as separate contributing factors to missed runs. Intentionally left out of this PR's scope to keep the diff small — happy to widen or follow up in a separate PR if preferred.

Copilot AI review requested due to automatic review settings May 17, 2026 16:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a scheduler regression where the tick file lock was held during job execution, causing subsequent ticks to be skipped and missed runs to be silently dropped.

Changes:

  • Refactors tick() to release the OS file lock immediately after advance_next_run() completes (before executing any jobs).
  • Adds a regression test asserting the lock is released before any run_job() call.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
cron/scheduler.py Shrinks the file-lock critical section to only due-job selection + next-run advancement; runs jobs after unlocking.
tests/cron/test_scheduler.py Adds regression coverage ensuring the lock is released before jobs execute.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cron/scheduler.py Outdated
Comment on lines +1803 to +1815
# Partition due jobs: those with a per-job workdir mutate
# os.environ["TERMINAL_CWD"] inside run_job, which is process-global —
# so they MUST run sequentially to avoid corrupting each other. Jobs
# without a workdir leave env untouched and stay parallel-safe.
workdir_jobs = [j for j in due_jobs if (j.get("workdir") or "").strip()]
parallel_jobs = [j for j in due_jobs if not (j.get("workdir") or "").strip()]

_results: list = []

# Sequential pass for workdir jobs.
for job in workdir_jobs:
_ctx = contextvars.copy_context()
_results.append(_ctx.run(_process_job, job))
Comment thread cron/scheduler.py
Comment on lines +1735 to +1753
# Resolve max parallel workers: env var > config.yaml > unbounded.
# Set HERMES_CRON_MAX_PARALLEL=1 to restore old serial behaviour.
_max_workers: Optional[int] = None
try:
_env_par = os.getenv("HERMES_CRON_MAX_PARALLEL", "").strip()
if _env_par:
_max_workers = int(_env_par) or None
except (ValueError, TypeError):
logger.warning("Invalid HERMES_CRON_MAX_PARALLEL value; defaulting to unbounded")
if _max_workers is None:
try:
_ucfg = load_config() or {}
_cfg_par = (
_ucfg.get("cron", {}) if isinstance(_ucfg, dict) else {}
).get("max_parallel_jobs")
if _cfg_par is not None:
_max_workers = int(_cfg_par) or None
except Exception:
pass
Comment on lines +2242 to +2247
from cron import scheduler

if scheduler.fcntl is None and scheduler.msvcrt is None:
pytest.skip("no file-locking primitive on this platform")

events: list = []
@alt-glitch alt-glitch added type/bug Something isn't working comp/cron Cron scheduler and job management P1 High — major feature broken, no workaround labels May 17, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to open PR #21901 (perf(cron): narrow file lock scope + heartbeat ticker) which targets the same lock-narrowing fix in cron/scheduler.py. Both PRs move job execution outside the fcntl critical section. #21901 also adds a heartbeat ticker. Maintainers should pick one and close the other.

@briandevans

Copy link
Copy Markdown
Contributor Author

CI audit — all 21 test failures are pre-existing baselines on clean origin/main (f36c89c). Zero failures are in touched code (cron/scheduler.py, tests/cron/test_parallel_tick.py).

Same set as the audit I posted on #27514 a few hours ago — identical 21 failures with identical error text (acp version bump, trust_env kwarg, direct_messages_topic_id metadata, _BUILTIN_SUBCOMMANDS missing send, etc.). Reproduces on main's push CI run 25989069834. Full per-test table at #27514.

Happy to address any failure that's actually in scope here.

@briandevans briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 3 times, most recently from 62e2afb to a839210 Compare May 24, 2026 18:16
@briandevans briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 18 times, most recently from 90a1310 to 16ceedf Compare June 2, 2026 14:15
@briandevans briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 12 times, most recently from fb819fe to 2056b31 Compare June 3, 2026 10:15
@liuhao1024

Copy link
Copy Markdown
Contributor

I verified this lock-narrowing fix is correct and the critical section boundary is well-placed.

Critical section scope: The try/finally now covers only get_due_jobs() + advance_next_run() — the minimum code that must be serialized. Once advance_next_run() has updated each job's next_run_at, concurrent ticks see them as no-longer-due, preserving at-most-once semantics.

Lock release before execution: Confirmed the file lock is released (in the finally block) before _process_job runs any due job. This means a long-running delegation no longer blocks subsequent tick cycles, which was the root cause of missed runs in #27485.

Variable scoping: due_jobs is assigned inside the try block before the finally releases the lock, so it's safely available for the post-lock execution phase. If get_due_jobs() or advance_next_run() raises, the lock is still released and the exception propagates — due_jobs would be undefined but execution stops anyway.

Concurrency semantics: The sequential/parallel job partitioning and contextvars.copy_context() usage are unchanged — they were moved outside the lock but retain identical behavior. The test correctly verifies lock_release precedes the first run_job call by tracking events with indices.

Edge case: If advance_next_run() fails mid-loop (e.g., for job 3 of 5), jobs 1-2 are advanced but 3-5 are not. A concurrent tick would re-pick 3-5 and potentially double-execute them. This is a pre-existing issue (the old code had the same problem inside the lock) and not introduced by this PR — worth noting but not blocking.

@briandevans briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch 12 times, most recently from b97e0cc to c8ea168 Compare June 4, 2026 11:17
Previously the fcntl/msvcrt LOCK_EX taken at the top of `tick()` was
held until the `ThreadPoolExecutor` exited — i.e. until every due job
had finished running. For a multi-minute Opus delegation that meant
every 60s ticker attempt during the run hit the lock, returned 0, and
combined with `compute_next_run`'s grace window (half-period, capped
at 2h) silently dropped overdue interval jobs instead of catching up.

The lock only needs to serialize the critical section that selects
due jobs and advances their `next_run_at`. Once `advance_next_run()`
has bumped each job's `next_run_at`, any concurrent tick that picks
up the lock afterwards will see the same jobs as no-longer-due, so
at-most-once semantics are preserved without holding the lock through
the rest of the function. The body of `tick()` is restructured so the
lock is released in a `finally` immediately after that critical
section; job execution, MCP orphan cleanup, and the result accounting
all run lock-free.

Behavior of `get_due_jobs`, `advance_next_run`, `_process_job`, the
workdir vs parallel partitioning, and the
`HERMES_CRON_MAX_PARALLEL` / `cron.max_parallel_jobs` resolution is
otherwise unchanged.

Regression test in `TestParallelTick` patches `fcntl.flock` /
`msvcrt.locking` to record `LOCK_UN`/`LK_UNLCK` events alongside
`run_job` calls and asserts the lock is released before any job
starts. The same test against the pre-fix scheduler shows the
expected event order `run_job, run_job, lock_release`.

Fixes NousResearch#27485.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans briandevans force-pushed the fix/cron-tick-release-lock-before-jobs-27485 branch from c8ea168 to 48b9823 Compare June 4, 2026 12:12
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the focused cron fix — the lock-boundary idea is still useful, but the surrounding scheduler has moved on current main.

Problems

  • The PR diff targets the old inline ThreadPoolExecutor/as_completed shape. Current cron/scheduler.py now has persistent pools and a running-job guard (_get_parallel_pool, _get_sequential_pool, _running_job_ids) at cron/scheduler.py:176-204 and dispatches through _submit_with_guard at cron/scheduler.py:2101-2123.
  • The production gateway ticker now calls cron_tick(..., sync=False) at gateway/run.py:15693, so the original “ticker waits for long-running jobs” path has already changed. Current main still unlocks in the final finally at cron/scheduler.py:2198-2209, after dispatch/sync waiting, so the exact release-before-run invariant is not implemented as-is.

Suggested changes

  • Salvage the lock narrowing onto the current tick() implementation: keep the lock around only get_due_jobs() + advance_next_run(), then release before _submit_with_guard/pool submission while preserving sync, persistent pools, _running_job_ids, and MCP cleanup callbacks.
  • Update the regression test to cover the current dispatch model, especially the production sync=False path.

Automated hermes-sweeper review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs

5 participants