fix(cron): retry delivery from cached output instead of re-running the agent (#8846)#16645
Open
prasadus92 wants to merge 1 commit into
Open
Conversation
…e agent (NousResearch#8846) When a cron job's task succeeds but delivery to the messaging platform fails (Telegram 502 / schema validation, Slack rate-limit, transient network), `mark_job_run` already records `last_status=ok` and tracks `last_delivery_error` separately. What was missing was the retry: the delivery would be lost on the floor until the next scheduled run, which for daily/weekly jobs means the user never sees the result. Adds a delivery-only retry path keyed off the cached output file under `~/.hermes/cron/output/<job_id>/<timestamp>.md`: cron/jobs.py - last_delivery_retry envelope on the job: output_file, attempts, next_attempt_at, last_error - DELIVERY_RETRY_BACKOFF_SECONDS = [60, 300, 1800], 3 attempts max - schedule_delivery_retry / mark_delivery_retry_attempt / clear_delivery_retry / get_due_delivery_retries cron/scheduler.py - tick() drains due retries before the regular due-jobs scan, via _drain_due_delivery_retries -> _deliver_result with the cached payload. The agent is never re-executed. - _process_job calls schedule_delivery_retry on the *first* delivery failure for a successful task. The retry pass runs first because it is cheap (no LLM, no tool calls) and a single bad envelope cannot break the rest of the tick. If the cached output file is missing (user pruned it), we surrender immediately rather than spinning - the file is the source of truth. After the budget is exhausted the envelope is cleared but `last_delivery_error` stays populated so `hermes cron list` continues to surface the failure to the user.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a cron job's task succeeds but delivery to the messaging platform fails (Telegram 502, schema validation, Slack rate-limit, transient network), the agent's output is already saved to
~/.hermes/cron/output/<job_id>/<timestamp>.mdandmark_job_runalready setslast_status=okwithlast_delivery_errortracked separately. What was missing was the retry: the delivery would sit on the floor until the next scheduled run, which for daily / weekly jobs means the user never sees the result.This PR adds a delivery-only retry path keyed off the cached output file. The agent is never re-executed - the saved
.mdis the source of truth.Fixes #8846.
Design
Per-job retry envelope persisted in
jobs.json:cron/jobs.py::DELIVERY_RETRY_BACKOFF_SECONDSso the policy is one edit away.tick()drains due retries before the regular due-jobs scan, via a new_drain_due_delivery_retrieshelper that reads the cached file and calls_deliver_result. Cheap path (no LLM, no tool calls, norun_job).last_delivery_errorstays populated, sohermes cron listcontinues to surface the failure.~/.hermes/cron/output/), we surrender immediately rather than spinning - the file is the source of truth and we can't reconstruct it.Files changed
cron/jobs.pylast_delivery_retryfield on the job dict (initialised toNoneincreate_job).DELIVERY_RETRY_BACKOFF_SECONDS = [60, 300, 1800],DELIVERY_RETRY_MAX_ATTEMPTS = 3.schedule_delivery_retry,mark_delivery_retry_attempt,clear_delivery_retry,get_due_delivery_retries- all under the existing_jobs_file_lock, same pattern asmark_job_run.cron/scheduler.py_process_jobcallsschedule_delivery_retryon the first delivery failure for a successful task.mark_job_runis unchanged -task_status=okwas already correct for this case.tick()runs_drain_due_delivery_retriesfirst. Returnssum(_results) + retry_count.tests/cron/test_delivery_retry.py(new, 14 cases): backoff schedule, envelope CRUD, due-job selection, drain replays from the cached file, drain advances backoff through to exhaustion, missing-file surrender path.Scope narrowing vs #13946 and #10443
Both adjacent PRs target cron retry / reliability with broader scope:
feat(cron): add bounded retry path for transient delivery failures, 503 +/16 -) introduces a transient-vs-terminal classification surface and threads it throughcron/jobs.pyandcron/scheduler.py. Useful direction but bigger - and it operates on the in-memory delivery result, not the cached output file, so a process restart between failure and retry loses the message.Cron scheduler reliability: observability, retry, audit trail, 441 +/84 -) adds a JSONL event log, a stale one-shot detector, retry policy, and a separate audit trail. Multi-concern PR.This PR is intentionally minimal: a single retry envelope on the existing job record, drained from the existing tick loop, against the existing cached output file. ~190 production lines, additive only, no schema migration needed (existing jobs have
last_delivery_retrydefault toNone). Should land cleanly alongside or before either of the two larger ones.Test plan
pytest tests/cron/test_jobs.py tests/cron/test_delivery_retry.py -v -p no:xdist -o addopts=''- 78 passed (64 pre-existing + 14 new). Output inTEST_OUTPUT.txt.task_status=oksemantics unchanged:mark_job_runonly seessuccess, the retry envelope is set aftermark_job_runreturns, and the budget-exhausted path leaveslast_status=okand only updateslast_delivery_error.tests/cron/test_scheduler.pyfailures are present without this change too - they relate to test-only deps (toolset registry / wake gate) and are not introduced here.Out of scope
hermes cron list. The data is on the job record, but the renderer change belongs inhermes_cli/cron_cli.pyand would push this PR past its narrow scope.hermes updatereboots beyond whatjobs.jsonalready gives us.