batch_runner: surface per-prompt failures in statistics.json#29
Merged
Conversation
Auditing the batch_runner output surface and found per-prompt failures were only printed to stdout during a run, then dropped: batch_*.jsonl persists successes only; statistics.json had tool-usage stats but no failure count; operators training on trajectories.jsonl could silently get a biased dataset (the non-failing prompts over-represented). Three additive changes: 1. _process_batch_worker tracks failed_in_batch + an error_samples list (capped at 3 per batch to bound memory in the worker return). 2. The worker result dict carries `failed` + `error_samples` alongside the existing keys. 3. BatchRunner.run aggregates total_failed + a cross-batch capped samples list (10 max) and writes them to statistics.json as prompts_processed_attempts / prompts_failed / failure_rate / error_samples — additive, never invalidates existing consumers of tool_statistics. Plus a printed "❌ Prompts failed: N (rate%)" summary line + up to 3 sample errors, so operators see the signal during the run too, not just in the JSON after. Closes #28.
This was referenced May 22, 2026
PowerCreek
added a commit
that referenced
this pull request
May 22, 2026
) Completing the gateway/cron/batch_runner audit sweep: - cron probe shipped in #27 (gateway PID + recent failures) - batch_runner failure stats shipped in #29 (per-prompt failures) - this PR: gateway runtime state itself `hermes doctor` previously only checked whether systemd linger was enabled. The gateway already maintains a rich runtime status file at gateway/status.py:read_runtime_status() — keyed by gateway_state ∈ {starting, running, draining, stopped, startup_failed, degraded}, with exit_reason, start_time, active_agents, and per-platform health. Doctor didn't read it. Add `_check_gateway_runtime()` covering: - `running` → check_ok with PID + uptime + active_agents - `degraded` → check_warn pointing at platform errors below - `startup_failed` → check_fail with exit_reason + updated_at - `stopped` → check_info (intentional stop, not an alert) - `starting`/`draining` → check_info (transient) - PID present but no state → check_warn (old build, stale status) Plus per-platform health: connected platforms listed as a single check_info line; any fatal platform becomes check_fail with error_message; any paused/retrying platform becomes check_warn. Inert when the runtime status file is absent (gateway never started — byte-stable default). Closes #30.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #28. Auditing the batch_runner output surface (next of the gateway/cron/batch_runner sweep) and found a real diagnostic gap: per-prompt failures were only printed to stdout during a run, then dropped.
batch_*.jsonlpersists successes only (line 486 — failure branch just prints❌and continues).statistics.jsonreported tool-usage stats but no failure count.trajectories.jsonl— and operators training on it would work with a biased dataset (non-failing prompts over-represented).Three additive changes
_process_batch_workernow tracksfailed_in_batch+ anerror_sampleslist (capped at 3 per batch to bound memory in the worker return).failed+error_samplesalongside the existingprocessed/skipped/tool_stats/etc.BatchRunner.runaggregatestotal_failed+ a cross-batch capped samples list (10 max) and writes them tostatistics.jsonasprompts_processed_attempts/prompts_failed/failure_rate/error_samples. Additive — never invalidates a consumer that only readstool_statistics.Plus a printed summary line during the run:
Test plan
tests/test_batch_runner_failure_stats.py:errorkeyprompts_failed/failure_rate/ cross-batch sample liststatistics.jsonpytest tests/test_batch_runner_failure_stats.py tests/test_batch_runner_checkpoint.py→ 23 passed (16 existing + 7 new). No regression.Filed by hermes-maintainer (PowerCreek).