Skip to content

Cron ticker silently dies when prompt-injection scanner blocks a job — entire cron scheduler stops until gateway restart #36854

@temalo

Description

@temalo

Summary

When the cron prompt-injection scanner blocks a job (raises the "exfil_curl_url" / "Cron prompts must not contain injection or exfiltration payloads" path in tools/cronjob_tools.py), the resulting exception kills the gateway's cron ticker thread silently. The gateway process stays up, systemd shows the unit as active (running), but no further cron job ever fires until the gateway is restarted. Heartbeat file ~/.hermes/cron/.tick.lock freezes at the moment of the scanner block.

This turns a single bad-job event into a complete cron outage for every job in jobs.json. The failure is invisible — no errors are written to gateway.log after the scanner WARNING, the systemd unit looks healthy, hermes gateway list reports the gateway as running.

Repro

  1. Create or edit a cron job whose assembled prompt contains a payload that triggers _CRON_EXFIL_COMMAND_PATTERNS — easiest is a delivery-from-prompt block of the shape:
    curl -sS -X POST "https://api.telegram.org/bot$TELEGRAM_BOT_TOKEN/sendMessage" -d "chat_id=..." --data-urlencode "text=$REPORT"
    
    (token variable substituted into the URL — exfil_curl_url).
  2. Trigger that job: hermes cron run <job_id>.
  3. On the next tick the scheduler logs:
    WARNING cron.scheduler: Job '...' (ID: ...): blocked by prompt-injection scanner — Blocked: prompt matches threat pattern 'exfil_curl_url'.
    
  4. The job is correctly marked last_status: error, next_run_at advances. So far so good.
  5. From this point on, no further cron job fires. .tick.lock stops updating. Other jobs whose next_run_at passes are simply skipped silently — no error, no warning, no last_run_at advance.
  6. systemctl is-active hermes-gateway.serviceactive. Gateway process is still alive (Telegram polling, kanban dispatcher, etc. continue working). Only the cron ticker thread is dead.
  7. Restarting the gateway (sudo systemctl restart hermes-gateway.service) restores cron ticking immediately.

Observed

  • Gateway: hermes-gateway.service running default profile, version installed today via hermes -p default gateway install --system.
  • One healthy tick at 14:35:08 fired two cron jobs successfully.
  • Scanner block triggered at 14:41:11 on a corpusiq-agent-profile job invoked from default-profile scheduler.
  • .tick.lock Modify timestamp froze at 14:41:11 and stayed there until manual systemctl restart at 14:49.
  • Between 14:41 and 14:49, four other cron jobs had next_run_at slots come and go with no fire.

Expected

The scanner block should mark the offending job as errored (already happens) and the ticker should continue ticking. One job's bad prompt should not silently take out the entire scheduler.

Probable cause (without reading the patch path)

The exception raised by the scanner appears to bubble out of the tick loop instead of being caught per-job. Wrapping the per-job processing block in a try/except around the scanner call (and any other prompt-assembly step) so that the ticker logs the failure and moves on would fix it.

A secondary improvement: when the cron ticker thread dies, the gateway should either (a) restart it, or (b) exit non-zero so systemd's Restart=always brings the gateway back. Right now the gateway process survives in a half-functional state, which makes the outage invisible to all health checks.

Environment

  • Hermes: install at ~/.hermes/hermes-agent/venv, current main as of 2026-05-31.
  • OS: Ubuntu under WSL2-style userspace, systemd-managed gateway via hermes -p <profile> gateway install --system.
  • Two gateway units active (default + named profile), neither raced; the failure was in the default-profile gateway only.

Happy to provide the exact systemd journal excerpt and the corresponding jobs.json snapshot (sanitized) if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/cronCron scheduler and job managementtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions