Skip to content

fix(cron): release tick lock before job execution#43652

Open
lac1203 wants to merge 1 commit into
NousResearch:mainfrom
lac1203:fix/cron-tick-lock-release
Open

fix(cron): release tick lock before job execution#43652
lac1203 wants to merge 1 commit into
NousResearch:mainfrom
lac1203:fix/cron-tick-lock-release

Conversation

@lac1203

@lac1203 lac1203 commented Jun 10, 2026

Copy link
Copy Markdown

Summary\n- release the cross-process cron tick lock after due-job selection and pre-run advancement instead of holding it through job execution\n- keep the existing at-most-once recurring-job advancement semantics\n- add a regression test proving a second tick can run a newly-due job while a previous long-running cron job is still executing\n\n## Test plan\n- python -m pytest tests/cron -q -o 'addopts='\n\n## Context\nObserved in gateway: a smoke-test cron job stayed scheduled with next_run_at in the past and last_run_at=null while another long-running cron job held ~/.hermes/cron/.tick.lock.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management duplicate This issue or pull request already exists labels Jun 10, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #21901 -- both narrow the cron tick() file lock to the scheduling critical section (get_due_jobs + advance_next_run) and release it before job execution, fixing tick starvation while a long job holds the lock. Saturated cluster; also overlaps open #38624 and #27492. #21901 is the earliest open canonical.

@liuhao1024

Copy link
Copy Markdown
Contributor

Verification: Lock release before job execution — clean concurrency fix

The change correctly moves fcntl.flock(LOCK_UN) + lock_fd.close() from the post-execution finally block to before the job execution loop. The _release_tick_lock() helper with its lock_released guard ensures idempotent release even if called from both the early-release path and the finally block.

Key correctness points:

  1. Lock scope is preserved — the lock still guards get_due_jobs() + advance_next_run() (preventing duplicate scheduling), but releases before run_job() starts.
  2. No double-unlock — the nonlocal lock_released flag prevents the finally block from calling unlock again.
  3. Thread-based test — the test proves that a second tick can proceed while the first tick's job is still running. Without this fix, the second tick would skip due to the held lock.

The fix directly addresses the scenario where a long-running cron job (e.g., a contribution cron with 10-minute git operations) blocks all subsequent gateway ticks, causing newly-due jobs to stay stuck in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants