Skip to content

Cron: API failure incorrectly reported as last_status=ok, no error notification delivered #17855

@Kim-Huang-JunKai

Description

@Kim-Huang-JunKai

Summary

When a cron job's LLM API call fails (e.g. timeout, retries exhausted), the job's last_status is incorrectly set to "ok" and no error notification is delivered to the user. The job appears to have succeeded even though the agent produced no useful output.

Root Cause

Two issues in cron/scheduler.py:

1. run_job() ignores the agent's failed flag (primary bug)

agent.run_conversation() returns a result dict with "failed": True, "completed": False, "error": "..." when the LLM API call fails internally (e.g. all retries exhausted). However, run_job() only reads result.get("final_response") and ignores the failed / completed / error fields:

cron/scheduler.py lines ~1099-1129:

final_response = result.get("final_response", "") or ""
# ... only uses final_response, never checks result.get("failed")
logger.info("Job '%s' completed successfully", job_name)
return True, output, final_response, None  # Always returns success=True

The agent's run_agent.py line ~11507 generates:

_final_response = f"API call failed after {max_retries} retries: {_final_summary}"
return {
    "final_response": _final_response,
    "failed": True,        # <-- ignored by run_job
    "completed": False,    # <-- ignored by run_job
    "error": _final_summary,
}

Since final_response is non-empty (contains the error text), _process_job's empty-response check at line ~1277 also doesn't trigger. Result: last_status="ok" with no error notification.

2. _process_job(): empty-response check runs AFTER delivery logic

The soft-failure detection for empty responses (line ~1277) happens after the delivery attempt (line ~1269). When success=True and final_response="", delivery is skipped because should_deliver = bool("") == False, then success is corrected to False — but by then the delivery window has passed:

# Delivery happens here (line ~1269) — already decided based on original success
if should_deliver:
    delivery_error = _deliver_result(...)

# Empty response check happens AFTER delivery (line ~1277)
if success and not final_response:
    success = False
    error = "Agent completed but produced empty response..."

Reproduction

  1. Configure a cron job with a custom provider that is slow/unreachable (e.g. a self-hosted endpoint)
  2. Let the API call time out after all retries
  3. Observe last_status: "ok", last_error: null, last_delivery_error: null
  4. No notification is delivered to the user
  5. The output file contains the error text but is treated as a successful run

Observed in Production

Job: "搞钱路子调研 - 早间 9:00" (ID: 5e91f26431f2)
Provider: custom (hrs.kstu.vip:10070)
Model: qwen3.6-27b
Last Run: 2026-04-30T01:05:55
Session: only 1 message (user prompt), no assistant reply
Output: "API call failed after 3 retries: Request timed out."
last_status: "ok"   ← wrong
last_error: null     ← should contain the timeout error
last_delivery_error: null

Suggested Fix

In run_job() — check the agent's failed flag before returning success:

_agent_failed = result.get("failed", False)
_agent_error = result.get("error") or ""
_agent_completed = result.get("completed", True)

# ... build output doc ...

if _agent_failed or not _agent_completed:
    error = _agent_error or final_response or "Agent failed to produce a response"
    return False, output, "", error

return True, output, final_response, None

In _process_job() — move failure detection before delivery so error notifications are sent:

# 1. Detect failures FIRST
if success and not final_response:
    success = False
    error = "Agent completed but produced empty response ..."

# 2. Then deliver (failed jobs get error notification)
deliver_content = final_response if success else f"⚠️ Cron job failed:\n{error}"

Optionally, also attempt delivery in the outer except Exception block of _process_job so unexpected crashes also notify the user.

Impact

  • Users are not notified when cron jobs fail silently
  • Failed jobs appear as successful in hermes cron list, making debugging difficult
  • No way to detect the failure without manually checking output files

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/cronCron scheduler and job managementtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions