[Feature]: Cron delivery retry mechanism for transient network failures

## Summary

The cron scheduler (`cron/scheduler.py`) currently has **no retry mechanism** for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in `last_delivery_error` and the result is silently dropped until the next scheduled run.

For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check `~/.hermes/cron/output/`.

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

## Current behavior

From `cron/scheduler.py:938-977` and `cron/scheduler.py:295-363` (approximate positions; line numbers shift with upstream churn):

```
1. advance_next_run()       — next run time is advanced first (crash-safe)
2. run_job()                — agent executes
3. save_job_output()        — always saved to ~/.hermes/cron/output/{job_id}/
4. [SILENT] check           — if [SILENT], delivery is skipped
5. _deliver_result()        — Telegram delivery
   - live adapter path      — future.result(timeout=60)
   - standalone fallback    — asyncio.run(), timeout=30
6. mark_job_run()           — last_delivery_error is recorded
```

Key observations:

- **Output is always persisted on disk** regardless of delivery success, so the information itself is not lost.
- **`last_delivery_error`** is set on failure and cleared on success, but **nothing consumes it** — the next tick proceeds with fresh jobs and the failed delivery is never retried.
- There is no delivery queue, no backoff, no retry loop anywhere in `cron/scheduler.py`.

## Reproduction scenario

1. Configure a cron job with Telegram delivery on a mobile laptop.
2. Put the laptop on an unstable network (train commute, moving between Wi-Fi/cellular, etc.).
3. A scheduled run completes; `_deliver_result()`'s live-adapter `future.result(timeout=60)` times out; the standalone fallback also times out.
4. `last_delivery_error = "timed out"` is set; the output remains in `~/.hermes/cron/output/{job_id}/` but is never delivered.
5. Network recovers 2 minutes later. The user sees nothing until the next scheduled run (N minutes/hours later), and the earlier output is never surfaced.

The user cannot tell whether "no notification" means "job found nothing" or "job found something but delivery failed".

## Proposed approaches

I see three approaches, presented in order of my preference. I'm aware option C is the most invasive; I'm raising this issue to learn which direction fits the project best before committing to an implementation.

### Option C (preferred): in-scheduler retry queue

Add a persistent retry queue to the scheduler itself, covering all cron jobs regardless of the skill that produced them.

**Sketch**:
- Persist a queue at `~/.hermes/cron/delivery_queue.json` with entries `{job_id, output_path, retry_count, next_retry_at, last_error}`.
- On delivery failure in `_deliver_result()`, enqueue an entry instead of just recording `last_delivery_error`.
- On each `_tick()`, check the queue first and retry entries whose `next_retry_at` has passed.
- Exponential backoff (e.g. 2m → 5m → 15m → 30m, max 4 attempts), then give up and record the final error in `last_delivery_error`.
- Unique entry IDs to prevent duplicate sends if a retry succeeds concurrently with a new tick.

**Trade-offs**:
- ✅ Single root fix for all cron jobs and all skills.
- ✅ Deterministic, LLM-independent behavior.
- ✅ Consistent UX: "if the job fires, you eventually hear about it."
- ⚠️ Introduces a new persisted state file that needs schema design, versioning, and compatibility with concurrent ticks (relevant after #13021 parallelization).
- ⚠️ Adds retry-policy parameters that require upstream decisions (backoff curve, max attempts, how to surface "gave up" to the user).
- ⚠️ Testing cost is higher (state machine).

### Option A (fallback): per-skill "check last delivery status" step

Extend `skills/monitoring/github-release-watch/scripts/check_releases.py` (or any skill that cares about delivery reliability) with a `--check-delivery-status` subcommand that reads `~/.hermes/cron/jobs.json`, detects a non-null `last_delivery_error` for its own `job_id`, and prepends the previous output to the current run's report.

The SKILL.md adds a Step 0 that calls this subcommand before running the main procedure.

**Trade-offs**:
- ✅ No Hermes core changes. Scoped to one skill.
- ✅ Safe to ship as a skill-level improvement first; can be promoted to core later if the pattern proves useful.
- ⚠️ Relies on the LLM reliably executing the added Step 0 every time. If the model skips it (fatigue, hallucination, modified prompt), the failure is silently missed again.
- ⚠️ Only helps skills that have been updated to use this pattern. Other cron jobs (agent-driven, other skills) remain exposed.

### Option B (not recommended): external recovery daemon

A separate `delivery_recovery.py` launched via `launchd` that polls `jobs.json` for non-null `last_delivery_error`, reads the most recent output file, and sends it directly through `python-telegram-bot` — bypassing Hermes entirely.

**Trade-offs**:
- ⚠️ Requires duplicating the Telegram bot token outside Hermes config.
- ⚠️ Independent state file (`delivery_recovery_state.json`) that must stay consistent with Hermes's view.
- ⚠️ Feels foreign to Hermes's in-tree philosophy.

I mention this for completeness but do **not** propose it as an upstream contribution.

## Design-alignment concern: interaction with #13021

PR #13021 recently parallelized `tick()` via `ThreadPoolExecutor`. Option C needs to be designed with this in mind:

- The retry queue's read-modify-write cycle must be protected by the same `threading.Lock` pattern used for `jobs.json` in #13021 (see `advance_next_run`, `mark_job_run`).
- ContextVars-based session/delivery state (`gateway/session_context.py`, introduced in #13021) must be re-established when a retry is dispatched from a later tick — the original job's delivery target must be preserved with the queue entry, not re-resolved at retry time.

Happy to prototype either Option A or Option C depending on upstream direction.

## Related

- PR #13495 (this author): `fix(cron): cancel orphan coroutine on delivery timeout before standalone fallback` — addresses duplicate delivery on the cron path.
- PR #13542 (this author): `fix(gateway): prevent duplicate final send when only cosmetic edit failed` — addresses duplicate delivery on the gateway/Telegram path.

These two PRs close the "same message twice" failure mode. This issue addresses the inverse failure mode: "message never arrives".

## Question for maintainers

Is Option C (in-scheduler retry queue) a direction the project would accept, given the added complexity and the recent churn in the cron subsystem from #13021? If not, would Option A (per-skill delivery-status step) be a welcome contribution as a narrower, lower-risk stepping stone?

Happy to prepare the corresponding PR once the direction is clear.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Cron delivery retry mechanism for transient network failures #13566

Summary

Current behavior

Reproduction scenario

Proposed approaches

Option C (preferred): in-scheduler retry queue

Option A (fallback): per-skill "check last delivery status" step

Option B (not recommended): external recovery daemon

Design-alignment concern: interaction with #13021

Related

Question for maintainers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Cron delivery retry mechanism for transient network failures #13566

Description

Summary

Current behavior

Reproduction scenario

Proposed approaches

Option C (preferred): in-scheduler retry queue

Option A (fallback): per-skill "check last delivery status" step

Option B (not recommended): external recovery daemon

Design-alignment concern: interaction with #13021

Related

Question for maintainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions