Summary
When a cron job's LLM API call fails (e.g. timeout, retries exhausted), the job's last_status is incorrectly set to "ok" and no error notification is delivered to the user. The job appears to have succeeded even though the agent produced no useful output.
Root Cause
Two issues in cron/scheduler.py:
1. run_job() ignores the agent's failed flag (primary bug)
agent.run_conversation() returns a result dict with "failed": True, "completed": False, "error": "..." when the LLM API call fails internally (e.g. all retries exhausted). However, run_job() only reads result.get("final_response") and ignores the failed / completed / error fields:
cron/scheduler.py lines ~1099-1129:
final_response = result.get("final_response", "") or ""
# ... only uses final_response, never checks result.get("failed")
logger.info("Job '%s' completed successfully", job_name)
return True, output, final_response, None # Always returns success=True
The agent's run_agent.py line ~11507 generates:
_final_response = f"API call failed after {max_retries} retries: {_final_summary}"
return {
"final_response": _final_response,
"failed": True, # <-- ignored by run_job
"completed": False, # <-- ignored by run_job
"error": _final_summary,
}
Since final_response is non-empty (contains the error text), _process_job's empty-response check at line ~1277 also doesn't trigger. Result: last_status="ok" with no error notification.
2. _process_job(): empty-response check runs AFTER delivery logic
The soft-failure detection for empty responses (line ~1277) happens after the delivery attempt (line ~1269). When success=True and final_response="", delivery is skipped because should_deliver = bool("") == False, then success is corrected to False — but by then the delivery window has passed:
# Delivery happens here (line ~1269) — already decided based on original success
if should_deliver:
delivery_error = _deliver_result(...)
# Empty response check happens AFTER delivery (line ~1277)
if success and not final_response:
success = False
error = "Agent completed but produced empty response..."
Reproduction
- Configure a cron job with a custom provider that is slow/unreachable (e.g. a self-hosted endpoint)
- Let the API call time out after all retries
- Observe
last_status: "ok", last_error: null, last_delivery_error: null
- No notification is delivered to the user
- The output file contains the error text but is treated as a successful run
Observed in Production
Job: "搞钱路子调研 - 早间 9:00" (ID: 5e91f26431f2)
Provider: custom (hrs.kstu.vip:10070)
Model: qwen3.6-27b
Last Run: 2026-04-30T01:05:55
Session: only 1 message (user prompt), no assistant reply
Output: "API call failed after 3 retries: Request timed out."
last_status: "ok" ← wrong
last_error: null ← should contain the timeout error
last_delivery_error: null
Suggested Fix
In run_job() — check the agent's failed flag before returning success:
_agent_failed = result.get("failed", False)
_agent_error = result.get("error") or ""
_agent_completed = result.get("completed", True)
# ... build output doc ...
if _agent_failed or not _agent_completed:
error = _agent_error or final_response or "Agent failed to produce a response"
return False, output, "", error
return True, output, final_response, None
In _process_job() — move failure detection before delivery so error notifications are sent:
# 1. Detect failures FIRST
if success and not final_response:
success = False
error = "Agent completed but produced empty response ..."
# 2. Then deliver (failed jobs get error notification)
deliver_content = final_response if success else f"⚠️ Cron job failed:\n{error}"
Optionally, also attempt delivery in the outer except Exception block of _process_job so unexpected crashes also notify the user.
Impact
- Users are not notified when cron jobs fail silently
- Failed jobs appear as successful in
hermes cron list, making debugging difficult
- No way to detect the failure without manually checking output files
🤖 Generated with Claude Code
Summary
When a cron job's LLM API call fails (e.g. timeout, retries exhausted), the job's
last_statusis incorrectly set to"ok"and no error notification is delivered to the user. The job appears to have succeeded even though the agent produced no useful output.Root Cause
Two issues in
cron/scheduler.py:1.
run_job()ignores the agent'sfailedflag (primary bug)agent.run_conversation()returns a result dict with"failed": True, "completed": False, "error": "..."when the LLM API call fails internally (e.g. all retries exhausted). However,run_job()only readsresult.get("final_response")and ignores thefailed/completed/errorfields:cron/scheduler.pylines ~1099-1129:The agent's
run_agent.pyline ~11507 generates:Since
final_responseis non-empty (contains the error text),_process_job's empty-response check at line ~1277 also doesn't trigger. Result:last_status="ok"with no error notification.2.
_process_job(): empty-response check runs AFTER delivery logicThe soft-failure detection for empty responses (line ~1277) happens after the delivery attempt (line ~1269). When
success=Trueandfinal_response="", delivery is skipped becauseshould_deliver = bool("") == False, then success is corrected to False — but by then the delivery window has passed:Reproduction
last_status: "ok",last_error: null,last_delivery_error: nullObserved in Production
Suggested Fix
In
run_job()— check the agent'sfailedflag before returning success:In
_process_job()— move failure detection before delivery so error notifications are sent:Optionally, also attempt delivery in the outer
except Exceptionblock of_process_jobso unexpected crashes also notify the user.Impact
hermes cron list, making debugging difficult🤖 Generated with Claude Code