Skip to content

[Feature]: Cron delivery retry mechanism for transient network failures #13566

@VTRiot

Description

@VTRiot

Summary

The cron scheduler (cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in last_delivery_error and the result is silently dropped until the next scheduled run.

For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check ~/.hermes/cron/output/.

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

Current behavior

From cron/scheduler.py:938-977 and cron/scheduler.py:295-363 (approximate positions; line numbers shift with upstream churn):

1. advance_next_run()       — next run time is advanced first (crash-safe)
2. run_job()                — agent executes
3. save_job_output()        — always saved to ~/.hermes/cron/output/{job_id}/
4. [SILENT] check           — if [SILENT], delivery is skipped
5. _deliver_result()        — Telegram delivery
   - live adapter path      — future.result(timeout=60)
   - standalone fallback    — asyncio.run(), timeout=30
6. mark_job_run()           — last_delivery_error is recorded

Key observations:

  • Output is always persisted on disk regardless of delivery success, so the information itself is not lost.
  • last_delivery_error is set on failure and cleared on success, but nothing consumes it — the next tick proceeds with fresh jobs and the failed delivery is never retried.
  • There is no delivery queue, no backoff, no retry loop anywhere in cron/scheduler.py.

Reproduction scenario

  1. Configure a cron job with Telegram delivery on a mobile laptop.
  2. Put the laptop on an unstable network (train commute, moving between Wi-Fi/cellular, etc.).
  3. A scheduled run completes; _deliver_result()'s live-adapter future.result(timeout=60) times out; the standalone fallback also times out.
  4. last_delivery_error = "timed out" is set; the output remains in ~/.hermes/cron/output/{job_id}/ but is never delivered.
  5. Network recovers 2 minutes later. The user sees nothing until the next scheduled run (N minutes/hours later), and the earlier output is never surfaced.

The user cannot tell whether "no notification" means "job found nothing" or "job found something but delivery failed".

Proposed approaches

I see three approaches, presented in order of my preference. I'm aware option C is the most invasive; I'm raising this issue to learn which direction fits the project best before committing to an implementation.

Option C (preferred): in-scheduler retry queue

Add a persistent retry queue to the scheduler itself, covering all cron jobs regardless of the skill that produced them.

Sketch:

  • Persist a queue at ~/.hermes/cron/delivery_queue.json with entries {job_id, output_path, retry_count, next_retry_at, last_error}.
  • On delivery failure in _deliver_result(), enqueue an entry instead of just recording last_delivery_error.
  • On each _tick(), check the queue first and retry entries whose next_retry_at has passed.
  • Exponential backoff (e.g. 2m → 5m → 15m → 30m, max 4 attempts), then give up and record the final error in last_delivery_error.
  • Unique entry IDs to prevent duplicate sends if a retry succeeds concurrently with a new tick.

Trade-offs:

  • ✅ Single root fix for all cron jobs and all skills.
  • ✅ Deterministic, LLM-independent behavior.
  • ✅ Consistent UX: "if the job fires, you eventually hear about it."
  • ⚠️ Introduces a new persisted state file that needs schema design, versioning, and compatibility with concurrent ticks (relevant after fix(cron): run due jobs in parallel to prevent serial tick starvation #13021 parallelization).
  • ⚠️ Adds retry-policy parameters that require upstream decisions (backoff curve, max attempts, how to surface "gave up" to the user).
  • ⚠️ Testing cost is higher (state machine).

Option A (fallback): per-skill "check last delivery status" step

Extend skills/monitoring/github-release-watch/scripts/check_releases.py (or any skill that cares about delivery reliability) with a --check-delivery-status subcommand that reads ~/.hermes/cron/jobs.json, detects a non-null last_delivery_error for its own job_id, and prepends the previous output to the current run's report.

The SKILL.md adds a Step 0 that calls this subcommand before running the main procedure.

Trade-offs:

  • ✅ No Hermes core changes. Scoped to one skill.
  • ✅ Safe to ship as a skill-level improvement first; can be promoted to core later if the pattern proves useful.
  • ⚠️ Relies on the LLM reliably executing the added Step 0 every time. If the model skips it (fatigue, hallucination, modified prompt), the failure is silently missed again.
  • ⚠️ Only helps skills that have been updated to use this pattern. Other cron jobs (agent-driven, other skills) remain exposed.

Option B (not recommended): external recovery daemon

A separate delivery_recovery.py launched via launchd that polls jobs.json for non-null last_delivery_error, reads the most recent output file, and sends it directly through python-telegram-bot — bypassing Hermes entirely.

Trade-offs:

  • ⚠️ Requires duplicating the Telegram bot token outside Hermes config.
  • ⚠️ Independent state file (delivery_recovery_state.json) that must stay consistent with Hermes's view.
  • ⚠️ Feels foreign to Hermes's in-tree philosophy.

I mention this for completeness but do not propose it as an upstream contribution.

Design-alignment concern: interaction with #13021

PR #13021 recently parallelized tick() via ThreadPoolExecutor. Option C needs to be designed with this in mind:

Happy to prototype either Option A or Option C depending on upstream direction.

Related

These two PRs close the "same message twice" failure mode. This issue addresses the inverse failure mode: "message never arrives".

Question for maintainers

Is Option C (in-scheduler retry queue) a direction the project would accept, given the added complexity and the recent churn in the cron subsystem from #13021? If not, would Option A (per-skill delivery-status step) be a welcome contribution as a narrower, lower-risk stepping stone?

Happy to prepare the corresponding PR once the direction is clear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    comp/cronCron scheduler and job managementtype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions