batch_runner: surface per-prompt failures in statistics.json by PowerCreek · Pull Request #29 · TechDevGroup/hermes-agent

PowerCreek · 2026-05-22T23:38:11Z

Summary

Closes #28. Auditing the batch_runner output surface (next of the gateway/cron/batch_runner sweep) and found a real diagnostic gap: per-prompt failures were only printed to stdout during a run, then dropped.

batch_*.jsonl persists successes only (line 486 — failure branch just prints ❌ and continues).
statistics.json reported tool-usage stats but no failure count.
A 1000-prompt run where 200 prompts hit auth/rate-limit/schema errors silently produced an 800-row trajectories.jsonl — and operators training on it would work with a biased dataset (non-failing prompts over-represented).

Three additive changes

_process_batch_worker now tracks failed_in_batch + an error_samples list (capped at 3 per batch to bound memory in the worker return).
The worker result dict carries failed + error_samples alongside the existing processed/skipped/tool_stats/etc.
BatchRunner.run aggregates total_failed + a cross-batch capped samples list (10 max) and writes them to statistics.json as prompts_processed_attempts / prompts_failed / failure_rate / error_samples. Additive — never invalidates a consumer that only reads tool_statistics.

Plus a printed summary line during the run:

❌ Prompts failed: 200 (20.0% failure rate)
   Sample errors:
     - prompt 17: authentication failed — set DEVAGENTIC_API_KEY
     - prompt 41: rate limit exceeded (retry-after 60s)
     - prompt 102: schema mismatch on tool call
     ... and 7 more in statistics.json

Test plan

7 new tests in tests/test_batch_runner_failure_stats.py:
- worker counts failures correctly
- sample cap (3 per batch) honored
- error string truncated to 200 chars
- "unknown" fallback when the underlying result has no error key
- mixed success/failure prompts counted correctly
- no-failure batch returns empty samples
- run-level aggregation produces correct prompts_failed / failure_rate / cross-batch sample list
- end-to-end: stub the worker and assert the new keys land in statistics.json
pytest tests/test_batch_runner_failure_stats.py tests/test_batch_runner_checkpoint.py → 23 passed (16 existing + 7 new). No regression.

Filed by hermes-maintainer (PowerCreek).

Auditing the batch_runner output surface and found per-prompt failures were only printed to stdout during a run, then dropped: batch_*.jsonl persists successes only; statistics.json had tool-usage stats but no failure count; operators training on trajectories.jsonl could silently get a biased dataset (the non-failing prompts over-represented). Three additive changes: 1. _process_batch_worker tracks failed_in_batch + an error_samples list (capped at 3 per batch to bound memory in the worker return). 2. The worker result dict carries `failed` + `error_samples` alongside the existing keys. 3. BatchRunner.run aggregates total_failed + a cross-batch capped samples list (10 max) and writes them to statistics.json as prompts_processed_attempts / prompts_failed / failure_rate / error_samples — additive, never invalidates existing consumers of tool_statistics. Plus a printed "❌ Prompts failed: N (rate%)" summary line + up to 3 sample errors, so operators see the signal during the run too, not just in the JSON after. Closes #28.

) Completing the gateway/cron/batch_runner audit sweep: - cron probe shipped in #27 (gateway PID + recent failures) - batch_runner failure stats shipped in #29 (per-prompt failures) - this PR: gateway runtime state itself `hermes doctor` previously only checked whether systemd linger was enabled. The gateway already maintains a rich runtime status file at gateway/status.py:read_runtime_status() — keyed by gateway_state ∈ {starting, running, draining, stopped, startup_failed, degraded}, with exit_reason, start_time, active_agents, and per-platform health. Doctor didn't read it. Add `_check_gateway_runtime()` covering: - `running` → check_ok with PID + uptime + active_agents - `degraded` → check_warn pointing at platform errors below - `startup_failed` → check_fail with exit_reason + updated_at - `stopped` → check_info (intentional stop, not an alert) - `starting`/`draining` → check_info (transient) - PID present but no state → check_warn (old build, stale status) Plus per-platform health: connected platforms listed as a single check_info line; any fatal platform becomes check_fail with error_message; any paused/retrying platform becomes check_warn. Inert when the runtime status file is absent (gateway never started — byte-stable default). Closes #30.

PowerCreek merged commit c6d4b30 into main May 22, 2026

PowerCreek deleted the batch-runner-failure-stats branch May 22, 2026 23:38

This was referenced May 22, 2026

hermes doctor surfaces gateway linger but not runtime state (state / uptime / active_agents / per-platform health) #30

Closed

doctor: probe gateway runtime state (state / uptime / per-platform) #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch_runner: surface per-prompt failures in statistics.json#29

batch_runner: surface per-prompt failures in statistics.json#29
PowerCreek merged 1 commit into
mainfrom
batch-runner-failure-stats

PowerCreek commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PowerCreek commented May 22, 2026

Summary

Three additive changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant