fix(cron): retry delivery from cached output instead of re-running the agent (#8846) by prasadus92 · Pull Request #16645 · NousResearch/hermes-agent

prasadus92 · 2026-04-27T16:50:50Z

Summary

When a cron job's task succeeds but delivery to the messaging platform fails (Telegram 502, schema validation, Slack rate-limit, transient network), the agent's output is already saved to ~/.hermes/cron/output/<job_id>/<timestamp>.md and mark_job_run already sets last_status=ok with last_delivery_error tracked separately. What was missing was the retry: the delivery would sit on the floor until the next scheduled run, which for daily / weekly jobs means the user never sees the result.

This PR adds a delivery-only retry path keyed off the cached output file. The agent is never re-executed - the saved .md is the source of truth.

Fixes #8846.

Design

Per-job retry envelope persisted in jobs.json:

"last_delivery_retry": {
  "output_file": "/Users/.../hermes/cron/output/abc123/2026-04-26_18-30-00.md",
  "attempts": 1,
  "next_attempt_at": "2026-04-26T18:35:00+00:00",
  "last_error": "telegram returned 502"
}

3 attempts max, exponential backoff at 60 s, 300 s, 1800 s (5 min, 30 min). Constants live in cron/jobs.py::DELIVERY_RETRY_BACKOFF_SECONDS so the policy is one edit away.
tick() drains due retries before the regular due-jobs scan, via a new _drain_due_delivery_retries helper that reads the cached file and calls _deliver_result. Cheap path (no LLM, no tool calls, no run_job).
After the budget is exhausted the envelope is cleared but last_delivery_error stays populated, so hermes cron list continues to surface the failure.
If the cached output file is missing (user pruned ~/.hermes/cron/output/), we surrender immediately rather than spinning - the file is the source of truth and we can't reconstruct it.

Files changed

cron/jobs.py
- New last_delivery_retry field on the job dict (initialised to None in create_job).
- DELIVERY_RETRY_BACKOFF_SECONDS = [60, 300, 1800], DELIVERY_RETRY_MAX_ATTEMPTS = 3.
- schedule_delivery_retry, mark_delivery_retry_attempt, clear_delivery_retry, get_due_delivery_retries - all under the existing _jobs_file_lock, same pattern as mark_job_run.
cron/scheduler.py
- _process_job calls schedule_delivery_retry on the first delivery failure for a successful task. mark_job_run is unchanged - task_status=ok was already correct for this case.
- tick() runs _drain_due_delivery_retries first. Returns sum(_results) + retry_count.
tests/cron/test_delivery_retry.py (new, 14 cases): backoff schedule, envelope CRUD, due-job selection, drain replays from the cached file, drain advances backoff through to exhaustion, missing-file surrender path.

Scope narrowing vs #13946 and #10443

Both adjacent PRs target cron retry / reliability with broader scope:

feat(cron): add bounded retry path for transient delivery failures #13946 (feat(cron): add bounded retry path for transient delivery failures, 503 +/16 -) introduces a transient-vs-terminal classification surface and threads it through cron/jobs.py and cron/scheduler.py. Useful direction but bigger - and it operates on the in-memory delivery result, not the cached output file, so a process restart between failure and retry loses the message.
Cron scheduler reliability: observability, retry, audit trail #10443 (Cron scheduler reliability: observability, retry, audit trail, 441 +/84 -) adds a JSONL event log, a stale one-shot detector, retry policy, and a separate audit trail. Multi-concern PR.

This PR is intentionally minimal: a single retry envelope on the existing job record, drained from the existing tick loop, against the existing cached output file. ~190 production lines, additive only, no schema migration needed (existing jobs have last_delivery_retry default to None). Should land cleanly alongside or before either of the two larger ones.

Test plan

pytest tests/cron/test_jobs.py tests/cron/test_delivery_retry.py -v -p no:xdist -o addopts='' - 78 passed (64 pre-existing + 14 new). Output in TEST_OUTPUT.txt.
Verified task_status=ok semantics unchanged: mark_job_run only sees success, the retry envelope is set after mark_job_run returns, and the budget-exhausted path leaves last_status=ok and only updates last_delivery_error.
Verified 17 pre-existing tests/cron/test_scheduler.py failures are present without this change too - they relate to test-only deps (toolset registry / wake gate) and are not introduced here.
Manual delivery-failure simulation against a live Telegram bot - I would like to do this in CI / staging if a maintainer can point me at the right lane.

Out of scope

Per-platform classification of "is this error worth retrying" (5xx / connection-reset vs 4xx / unknown chat). Currently every failure is retried; the 3-attempt cap bounds the cost. feat(cron): add bounded retry path for transient delivery failures #13946 is the right place for that classifier.
Surfacing retry state in hermes cron list. The data is on the job record, but the renderer change belongs in hermes_cli/cron_cli.py and would push this PR past its narrow scope.
Persisting retry counters across hermes update reboots beyond what jobs.json already gives us.

…e agent (NousResearch#8846) When a cron job's task succeeds but delivery to the messaging platform fails (Telegram 502 / schema validation, Slack rate-limit, transient network), `mark_job_run` already records `last_status=ok` and tracks `last_delivery_error` separately. What was missing was the retry: the delivery would be lost on the floor until the next scheduled run, which for daily/weekly jobs means the user never sees the result. Adds a delivery-only retry path keyed off the cached output file under `~/.hermes/cron/output/<job_id>/<timestamp>.md`: cron/jobs.py - last_delivery_retry envelope on the job: output_file, attempts, next_attempt_at, last_error - DELIVERY_RETRY_BACKOFF_SECONDS = [60, 300, 1800], 3 attempts max - schedule_delivery_retry / mark_delivery_retry_attempt / clear_delivery_retry / get_due_delivery_retries cron/scheduler.py - tick() drains due retries before the regular due-jobs scan, via _drain_due_delivery_retries -> _deliver_result with the cached payload. The agent is never re-executed. - _process_job calls schedule_delivery_retry on the *first* delivery failure for a successful task. The retry pass runs first because it is cheap (no LLM, no tool calls) and a single bad envelope cannot break the rest of the tick. If the cached output file is missing (user pruned it), we surrender immediately rather than spinning - the file is the source of truth. After the budget is exhausted the envelope is cleared but `last_delivery_error` stays populated so `hermes cron list` continues to surface the failure to the user.

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management labels Apr 27, 2026

alt-glitch mentioned this pull request Apr 28, 2026

[Feature]: Cron job stage persistence + partial retry mechanism - Real world case: 2 million tokens wasted due to push failures #17071

Open

1 task

alt-glitch mentioned this pull request May 15, 2026

fix(cron): retry failed delivery sends #26268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cron): retry delivery from cached output instead of re-running the agent (#8846)#16645

fix(cron): retry delivery from cached output instead of re-running the agent (#8846)#16645
prasadus92 wants to merge 1 commit into
NousResearch:mainfrom
prasadus92:prasadus92/fix-8846-cron-delivery-retry

prasadus92 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prasadus92 commented Apr 27, 2026

Summary

Design

Files changed

Scope narrowing vs #13946 and #10443

Test plan

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants