Skip to content

fix(cron): retry delivery from cached output instead of re-running the agent (#8846)#16645

Open
prasadus92 wants to merge 1 commit into
NousResearch:mainfrom
prasadus92:prasadus92/fix-8846-cron-delivery-retry
Open

fix(cron): retry delivery from cached output instead of re-running the agent (#8846)#16645
prasadus92 wants to merge 1 commit into
NousResearch:mainfrom
prasadus92:prasadus92/fix-8846-cron-delivery-retry

Conversation

@prasadus92

Copy link
Copy Markdown
Contributor

Summary

When a cron job's task succeeds but delivery to the messaging platform fails (Telegram 502, schema validation, Slack rate-limit, transient network), the agent's output is already saved to ~/.hermes/cron/output/<job_id>/<timestamp>.md and mark_job_run already sets last_status=ok with last_delivery_error tracked separately. What was missing was the retry: the delivery would sit on the floor until the next scheduled run, which for daily / weekly jobs means the user never sees the result.

This PR adds a delivery-only retry path keyed off the cached output file. The agent is never re-executed - the saved .md is the source of truth.

Fixes #8846.

Design

Per-job retry envelope persisted in jobs.json:

"last_delivery_retry": {
  "output_file": "/Users/.../hermes/cron/output/abc123/2026-04-26_18-30-00.md",
  "attempts": 1,
  "next_attempt_at": "2026-04-26T18:35:00+00:00",
  "last_error": "telegram returned 502"
}
  • 3 attempts max, exponential backoff at 60 s, 300 s, 1800 s (5 min, 30 min). Constants live in cron/jobs.py::DELIVERY_RETRY_BACKOFF_SECONDS so the policy is one edit away.
  • tick() drains due retries before the regular due-jobs scan, via a new _drain_due_delivery_retries helper that reads the cached file and calls _deliver_result. Cheap path (no LLM, no tool calls, no run_job).
  • After the budget is exhausted the envelope is cleared but last_delivery_error stays populated, so hermes cron list continues to surface the failure.
  • If the cached output file is missing (user pruned ~/.hermes/cron/output/), we surrender immediately rather than spinning - the file is the source of truth and we can't reconstruct it.

Files changed

  • cron/jobs.py

    • New last_delivery_retry field on the job dict (initialised to None in create_job).
    • DELIVERY_RETRY_BACKOFF_SECONDS = [60, 300, 1800], DELIVERY_RETRY_MAX_ATTEMPTS = 3.
    • schedule_delivery_retry, mark_delivery_retry_attempt, clear_delivery_retry, get_due_delivery_retries - all under the existing _jobs_file_lock, same pattern as mark_job_run.
  • cron/scheduler.py

    • _process_job calls schedule_delivery_retry on the first delivery failure for a successful task. mark_job_run is unchanged - task_status=ok was already correct for this case.
    • tick() runs _drain_due_delivery_retries first. Returns sum(_results) + retry_count.
  • tests/cron/test_delivery_retry.py (new, 14 cases): backoff schedule, envelope CRUD, due-job selection, drain replays from the cached file, drain advances backoff through to exhaustion, missing-file surrender path.

Scope narrowing vs #13946 and #10443

Both adjacent PRs target cron retry / reliability with broader scope:

This PR is intentionally minimal: a single retry envelope on the existing job record, drained from the existing tick loop, against the existing cached output file. ~190 production lines, additive only, no schema migration needed (existing jobs have last_delivery_retry default to None). Should land cleanly alongside or before either of the two larger ones.

Test plan

  • pytest tests/cron/test_jobs.py tests/cron/test_delivery_retry.py -v -p no:xdist -o addopts='' - 78 passed (64 pre-existing + 14 new). Output in TEST_OUTPUT.txt.
  • Verified task_status=ok semantics unchanged: mark_job_run only sees success, the retry envelope is set after mark_job_run returns, and the budget-exhausted path leaves last_status=ok and only updates last_delivery_error.
  • Verified 17 pre-existing tests/cron/test_scheduler.py failures are present without this change too - they relate to test-only deps (toolset registry / wake gate) and are not introduced here.
  • Manual delivery-failure simulation against a live Telegram bot - I would like to do this in CI / staging if a maintainer can point me at the right lane.

Out of scope

  • Per-platform classification of "is this error worth retrying" (5xx / connection-reset vs 4xx / unknown chat). Currently every failure is retried; the 3-attempt cap bounds the cost. feat(cron): add bounded retry path for transient delivery failures #13946 is the right place for that classifier.
  • Surfacing retry state in hermes cron list. The data is on the job record, but the renderer change belongs in hermes_cli/cron_cli.py and would push this PR past its narrow scope.
  • Persisting retry counters across hermes update reboots beyond what jobs.json already gives us.

…e agent (NousResearch#8846)

When a cron job's task succeeds but delivery to the messaging platform
fails (Telegram 502 / schema validation, Slack rate-limit, transient
network), `mark_job_run` already records `last_status=ok` and tracks
`last_delivery_error` separately. What was missing was the retry: the
delivery would be lost on the floor until the next scheduled run, which
for daily/weekly jobs means the user never sees the result.

Adds a delivery-only retry path keyed off the cached output file under
`~/.hermes/cron/output/<job_id>/<timestamp>.md`:

  cron/jobs.py
    - last_delivery_retry envelope on the job: output_file, attempts,
      next_attempt_at, last_error
    - DELIVERY_RETRY_BACKOFF_SECONDS = [60, 300, 1800], 3 attempts max
    - schedule_delivery_retry / mark_delivery_retry_attempt /
      clear_delivery_retry / get_due_delivery_retries

  cron/scheduler.py
    - tick() drains due retries before the regular due-jobs scan, via
      _drain_due_delivery_retries -> _deliver_result with the cached
      payload. The agent is never re-executed.
    - _process_job calls schedule_delivery_retry on the *first* delivery
      failure for a successful task.

The retry pass runs first because it is cheap (no LLM, no tool calls)
and a single bad envelope cannot break the rest of the tick. If the
cached output file is missing (user pruned it), we surrender immediately
rather than spinning - the file is the source of truth.

After the budget is exhausted the envelope is cleared but
`last_delivery_error` stays populated so `hermes cron list` continues
to surface the failure to the user.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management labels Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron: delivery failure should not block or fail the main task

2 participants