Skip to content

fix(cron): release tick lock before job execution to prevent scheduler starvation#38624

Open
liuhao1024 wants to merge 2 commits into
NousResearch:mainfrom
liuhao1024:fix/cron-tick-lock-starvation
Open

fix(cron): release tick lock before job execution to prevent scheduler starvation#38624
liuhao1024 wants to merge 2 commits into
NousResearch:mainfrom
liuhao1024:fix/cron-tick-lock-starvation

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

What does this PR do?

Releases the tick file lock before executing cron jobs, preventing scheduler starvation when jobs run long. Previously, tick() held LOCK_EX for the entire duration of all job executions — not just the scheduling decision — causing every subsequent 60s tick to skip and miss scheduled runs.

Related Issue

Fixes #27485

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • cron/scheduler.py: Restructured tick() to release the file lock immediately after advance_next_run() completes. The lock is now held only during the critical section (get_due_jobs + advance_next_run), not during job execution. At-most-once semantics are preserved because next_run_at is advanced before any job starts.

How to Test

  1. Create a cron job with a short interval (e.g., 5 minutes) and a long-running prompt (e.g., one that triggers a multi-minute Opus delegation)
  2. Verify the scheduler continues ticking every 60 seconds during job execution (check logs for "job(s) due" entries)
  3. Verify no missed runs: the next scheduled time should advance correctly even when the previous job is still running
  4. Run pytest tests/cron/ -q — all existing tests should pass (1 pre-existing failure in test_all_token_case_insensitive is unrelated)

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/cron/ -q and all tests pass (1 pre-existing failure unrelated to this change)
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Code Intelligence

  • Analyzed: cron/scheduler.py::tick() (callers: gateway ticker thread, standalone daemon, manual hermes cron run)
  • Blast radius: LOW — lock release is moved earlier; no new code paths introduced
  • Related patterns: advance_next_run() already sets next_run_at before execution, preserving at-most-once semantics; mark_job_run() uses its own _jobs_file_lock for thread safety

…r starvation

The tick() function held a LOCK_EX file lock for the entire duration of
all job executions, not just the scheduling decision. When a cron job
ran a long-running task (e.g. an Opus delegation lasting 2-4 min),
every subsequent 60s tick attempt would hit the lock, skip, and return 0.

Combined with the grace-window logic in compute_next_run, missed runs
were silently dropped with no error logged.

Fix: release the file lock immediately after advance_next_run() completes.
The at-most-once semantics are preserved because next_run_at is already
advanced before any job starts executing.

Fixes NousResearch#27485
@alt-glitch alt-glitch added type/bug Something isn't working comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists labels Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs

2 participants