You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The cron scheduler (cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in last_delivery_error and the result is silently dropped until the next scheduled run.
For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check ~/.hermes/cron/output/.
This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.
Current behavior
From cron/scheduler.py:938-977 and cron/scheduler.py:295-363 (approximate positions; line numbers shift with upstream churn):
1. advance_next_run() — next run time is advanced first (crash-safe)
2. run_job() — agent executes
3. save_job_output() — always saved to ~/.hermes/cron/output/{job_id}/
4. [SILENT] check — if [SILENT], delivery is skipped
5. _deliver_result() — Telegram delivery
- live adapter path — future.result(timeout=60)
- standalone fallback — asyncio.run(), timeout=30
6. mark_job_run() — last_delivery_error is recorded
Key observations:
Output is always persisted on disk regardless of delivery success, so the information itself is not lost.
last_delivery_error is set on failure and cleared on success, but nothing consumes it — the next tick proceeds with fresh jobs and the failed delivery is never retried.
There is no delivery queue, no backoff, no retry loop anywhere in cron/scheduler.py.
Reproduction scenario
Configure a cron job with Telegram delivery on a mobile laptop.
Put the laptop on an unstable network (train commute, moving between Wi-Fi/cellular, etc.).
A scheduled run completes; _deliver_result()'s live-adapter future.result(timeout=60) times out; the standalone fallback also times out.
last_delivery_error = "timed out" is set; the output remains in ~/.hermes/cron/output/{job_id}/ but is never delivered.
Network recovers 2 minutes later. The user sees nothing until the next scheduled run (N minutes/hours later), and the earlier output is never surfaced.
The user cannot tell whether "no notification" means "job found nothing" or "job found something but delivery failed".
Proposed approaches
I see three approaches, presented in order of my preference. I'm aware option C is the most invasive; I'm raising this issue to learn which direction fits the project best before committing to an implementation.
Option C (preferred): in-scheduler retry queue
Add a persistent retry queue to the scheduler itself, covering all cron jobs regardless of the skill that produced them.
Sketch:
Persist a queue at ~/.hermes/cron/delivery_queue.json with entries {job_id, output_path, retry_count, next_retry_at, last_error}.
On delivery failure in _deliver_result(), enqueue an entry instead of just recording last_delivery_error.
On each _tick(), check the queue first and retry entries whose next_retry_at has passed.
Exponential backoff (e.g. 2m → 5m → 15m → 30m, max 4 attempts), then give up and record the final error in last_delivery_error.
Unique entry IDs to prevent duplicate sends if a retry succeeds concurrently with a new tick.
Trade-offs:
✅ Single root fix for all cron jobs and all skills.
✅ Deterministic, LLM-independent behavior.
✅ Consistent UX: "if the job fires, you eventually hear about it."
⚠️ Adds retry-policy parameters that require upstream decisions (backoff curve, max attempts, how to surface "gave up" to the user).
⚠️ Testing cost is higher (state machine).
Option A (fallback): per-skill "check last delivery status" step
Extend skills/monitoring/github-release-watch/scripts/check_releases.py (or any skill that cares about delivery reliability) with a --check-delivery-status subcommand that reads ~/.hermes/cron/jobs.json, detects a non-null last_delivery_error for its own job_id, and prepends the previous output to the current run's report.
The SKILL.md adds a Step 0 that calls this subcommand before running the main procedure.
Trade-offs:
✅ No Hermes core changes. Scoped to one skill.
✅ Safe to ship as a skill-level improvement first; can be promoted to core later if the pattern proves useful.
⚠️ Relies on the LLM reliably executing the added Step 0 every time. If the model skips it (fatigue, hallucination, modified prompt), the failure is silently missed again.
⚠️ Only helps skills that have been updated to use this pattern. Other cron jobs (agent-driven, other skills) remain exposed.
Option B (not recommended): external recovery daemon
A separate delivery_recovery.py launched via launchd that polls jobs.json for non-null last_delivery_error, reads the most recent output file, and sends it directly through python-telegram-bot — bypassing Hermes entirely.
Trade-offs:
⚠️ Requires duplicating the Telegram bot token outside Hermes config.
⚠️ Independent state file (delivery_recovery_state.json) that must stay consistent with Hermes's view.
⚠️ Feels foreign to Hermes's in-tree philosophy.
I mention this for completeness but do not propose it as an upstream contribution.
ContextVars-based session/delivery state (gateway/session_context.py, introduced in fix(cron): run due jobs in parallel to prevent serial tick starvation #13021) must be re-established when a retry is dispatched from a later tick — the original job's delivery target must be preserved with the queue entry, not re-resolved at retry time.
Happy to prototype either Option A or Option C depending on upstream direction.
These two PRs close the "same message twice" failure mode. This issue addresses the inverse failure mode: "message never arrives".
Question for maintainers
Is Option C (in-scheduler retry queue) a direction the project would accept, given the added complexity and the recent churn in the cron subsystem from #13021? If not, would Option A (per-skill delivery-status step) be a welcome contribution as a narrower, lower-risk stepping stone?
Happy to prepare the corresponding PR once the direction is clear.
Summary
The cron scheduler (
cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded inlast_delivery_errorand the result is silently dropped until the next scheduled run.For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check
~/.hermes/cron/output/.This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.
Current behavior
From
cron/scheduler.py:938-977andcron/scheduler.py:295-363(approximate positions; line numbers shift with upstream churn):Key observations:
last_delivery_erroris set on failure and cleared on success, but nothing consumes it — the next tick proceeds with fresh jobs and the failed delivery is never retried.cron/scheduler.py.Reproduction scenario
_deliver_result()'s live-adapterfuture.result(timeout=60)times out; the standalone fallback also times out.last_delivery_error = "timed out"is set; the output remains in~/.hermes/cron/output/{job_id}/but is never delivered.The user cannot tell whether "no notification" means "job found nothing" or "job found something but delivery failed".
Proposed approaches
I see three approaches, presented in order of my preference. I'm aware option C is the most invasive; I'm raising this issue to learn which direction fits the project best before committing to an implementation.
Option C (preferred): in-scheduler retry queue
Add a persistent retry queue to the scheduler itself, covering all cron jobs regardless of the skill that produced them.
Sketch:
~/.hermes/cron/delivery_queue.jsonwith entries{job_id, output_path, retry_count, next_retry_at, last_error}._deliver_result(), enqueue an entry instead of just recordinglast_delivery_error._tick(), check the queue first and retry entries whosenext_retry_athas passed.last_delivery_error.Trade-offs:
Option A (fallback): per-skill "check last delivery status" step
Extend
skills/monitoring/github-release-watch/scripts/check_releases.py(or any skill that cares about delivery reliability) with a--check-delivery-statussubcommand that reads~/.hermes/cron/jobs.json, detects a non-nulllast_delivery_errorfor its ownjob_id, and prepends the previous output to the current run's report.The SKILL.md adds a Step 0 that calls this subcommand before running the main procedure.
Trade-offs:
Option B (not recommended): external recovery daemon
A separate
delivery_recovery.pylaunched vialaunchdthat pollsjobs.jsonfor non-nulllast_delivery_error, reads the most recent output file, and sends it directly throughpython-telegram-bot— bypassing Hermes entirely.Trade-offs:
delivery_recovery_state.json) that must stay consistent with Hermes's view.I mention this for completeness but do not propose it as an upstream contribution.
Design-alignment concern: interaction with #13021
PR #13021 recently parallelized
tick()viaThreadPoolExecutor. Option C needs to be designed with this in mind:threading.Lockpattern used forjobs.jsonin fix(cron): run due jobs in parallel to prevent serial tick starvation #13021 (seeadvance_next_run,mark_job_run).gateway/session_context.py, introduced in fix(cron): run due jobs in parallel to prevent serial tick starvation #13021) must be re-established when a retry is dispatched from a later tick — the original job's delivery target must be preserved with the queue entry, not re-resolved at retry time.Happy to prototype either Option A or Option C depending on upstream direction.
Related
fix(cron): cancel orphan coroutine on delivery timeout before standalone fallback— addresses duplicate delivery on the cron path.fix(gateway): prevent duplicate final send when only cosmetic edit failed— addresses duplicate delivery on the gateway/Telegram path.These two PRs close the "same message twice" failure mode. This issue addresses the inverse failure mode: "message never arrives".
Question for maintainers
Is Option C (in-scheduler retry queue) a direction the project would accept, given the added complexity and the recent churn in the cron subsystem from #13021? If not, would Option A (per-skill delivery-status step) be a welcome contribution as a narrower, lower-risk stepping stone?
Happy to prepare the corresponding PR once the direction is clear.