Skip to content

batch_runner: surface per-prompt failures in statistics.json#29

Merged
PowerCreek merged 1 commit into
mainfrom
batch-runner-failure-stats
May 22, 2026
Merged

batch_runner: surface per-prompt failures in statistics.json#29
PowerCreek merged 1 commit into
mainfrom
batch-runner-failure-stats

Conversation

@PowerCreek

Copy link
Copy Markdown

Summary

Closes #28. Auditing the batch_runner output surface (next of the gateway/cron/batch_runner sweep) and found a real diagnostic gap: per-prompt failures were only printed to stdout during a run, then dropped.

  • batch_*.jsonl persists successes only (line 486 — failure branch just prints and continues).
  • statistics.json reported tool-usage stats but no failure count.
  • A 1000-prompt run where 200 prompts hit auth/rate-limit/schema errors silently produced an 800-row trajectories.jsonl — and operators training on it would work with a biased dataset (non-failing prompts over-represented).

Three additive changes

  1. _process_batch_worker now tracks failed_in_batch + an error_samples list (capped at 3 per batch to bound memory in the worker return).
  2. The worker result dict carries failed + error_samples alongside the existing processed/skipped/tool_stats/etc.
  3. BatchRunner.run aggregates total_failed + a cross-batch capped samples list (10 max) and writes them to statistics.json as prompts_processed_attempts / prompts_failed / failure_rate / error_samples. Additive — never invalidates a consumer that only reads tool_statistics.

Plus a printed summary line during the run:

❌ Prompts failed: 200 (20.0% failure rate)
   Sample errors:
     - prompt 17: authentication failed — set DEVAGENTIC_API_KEY
     - prompt 41: rate limit exceeded (retry-after 60s)
     - prompt 102: schema mismatch on tool call
     ... and 7 more in statistics.json

Test plan

  • 7 new tests in tests/test_batch_runner_failure_stats.py:
    • worker counts failures correctly
    • sample cap (3 per batch) honored
    • error string truncated to 200 chars
    • "unknown" fallback when the underlying result has no error key
    • mixed success/failure prompts counted correctly
    • no-failure batch returns empty samples
    • run-level aggregation produces correct prompts_failed / failure_rate / cross-batch sample list
    • end-to-end: stub the worker and assert the new keys land in statistics.json
  • pytest tests/test_batch_runner_failure_stats.py tests/test_batch_runner_checkpoint.py → 23 passed (16 existing + 7 new). No regression.

Filed by hermes-maintainer (PowerCreek).

Auditing the batch_runner output surface and found per-prompt
failures were only printed to stdout during a run, then dropped:
batch_*.jsonl persists successes only; statistics.json had
tool-usage stats but no failure count; operators training on
trajectories.jsonl could silently get a biased dataset (the
non-failing prompts over-represented).

Three additive changes:

1. _process_batch_worker tracks failed_in_batch + an error_samples
   list (capped at 3 per batch to bound memory in the worker
   return).
2. The worker result dict carries `failed` + `error_samples`
   alongside the existing keys.
3. BatchRunner.run aggregates total_failed + a cross-batch
   capped samples list (10 max) and writes them to statistics.json
   as prompts_processed_attempts / prompts_failed / failure_rate /
   error_samples — additive, never invalidates existing consumers
   of tool_statistics.

Plus a printed "❌ Prompts failed: N (rate%)" summary line + up to
3 sample errors, so operators see the signal during the run too,
not just in the JSON after.

Closes #28.
@PowerCreek PowerCreek merged commit c6d4b30 into main May 22, 2026
@PowerCreek PowerCreek deleted the batch-runner-failure-stats branch May 22, 2026 23:38
PowerCreek added a commit that referenced this pull request May 22, 2026
)

Completing the gateway/cron/batch_runner audit sweep:
  - cron probe shipped in #27 (gateway PID + recent failures)
  - batch_runner failure stats shipped in #29 (per-prompt failures)
  - this PR: gateway runtime state itself

`hermes doctor` previously only checked whether systemd linger was
enabled. The gateway already maintains a rich runtime status file
at gateway/status.py:read_runtime_status() — keyed by
gateway_state ∈ {starting, running, draining, stopped,
startup_failed, degraded}, with exit_reason, start_time,
active_agents, and per-platform health. Doctor didn't read it.

Add `_check_gateway_runtime()` covering:

- `running` → check_ok with PID + uptime + active_agents
- `degraded` → check_warn pointing at platform errors below
- `startup_failed` → check_fail with exit_reason + updated_at
- `stopped` → check_info (intentional stop, not an alert)
- `starting`/`draining` → check_info (transient)
- PID present but no state → check_warn (old build, stale status)

Plus per-platform health: connected platforms listed as a single
check_info line; any fatal platform becomes check_fail with
error_message; any paused/retrying platform becomes check_warn.

Inert when the runtime status file is absent (gateway never
started — byte-stable default).

Closes #30.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

batch_runner: per-prompt failures are invisible after a run — statistics.json has tool-usage but no failure count

1 participant