batch_runner: per-prompt failures are invisible after a run — statistics.json has tool-usage but no failure count

## Context

Auditing the batch_runner surface per the gateway/cron/batch_runner sweep. Found a real diagnostic gap: failed prompts are only printed during execution and never persisted.

Looking at the flow in `batch_runner.py`:

- `_process_single_prompt` (line 244) returns `{success: False, error: <msg>}` on failure.
- `_process_batch_worker` (line 400) checks `result["success"]`. Only successes are written to `batch_N.jsonl` (line 486). Failures get a stdout `❌ Prompt N failed` and are dropped — the `error` string isn't persisted anywhere.
- The worker return dict (line 516) tracks `processed`, `skipped`, `tool_stats`, `completed_prompts` — but NOT `failed` count and NOT `error` samples.
- The final consolidation (line 1026+) reads only the success-jsonl entries, computes tool-usage stats, and writes `statistics.json` (line 1073) WITHOUT any failure metric.

## Impact

Run a 1000-prompt batch where 200 prompts fail due to (auth expired / rate-limit / schema mismatch / network blip). After the run:

- `statistics.json` shows tool-usage stats only — no failure count, no failure rate.
- `trajectories.jsonl` has 800 entries — operator has to know the input size to discover 200 are missing.
- Per-prompt errors exist nowhere on disk. The only signal is scrolled-past `❌` lines in the run output.

Operators training on `trajectories.jsonl` would silently work with a biased dataset (the 800 prompts that DIDN'T hit a transient failure are over-represented).

## Fix

Three small additions:

1. `_process_batch_worker` accumulates `failed_in_batch` count + an `error_samples` list (capped at 3 per batch).
2. Returned in the worker result dict alongside the existing `processed` / `skipped` / etc.
3. `BatchRunner.run()` aggregates `total_failed` and the first ~10 `error_samples` into `final_stats` as `prompts_failed` + `failure_rate` + `error_samples`, and prints a summary line below the existing "Prompts processed this run" line.

Backward-compatible: existing keys in `statistics.json` are untouched; new keys are additive. Existing checkpoint format is unchanged.

## Out of scope

- Per-prompt error logging to disk beyond the sample. Operators who need full failure detail should add their own logging hook to `_process_single_prompt`; a 100MB error log per run is a different design.
- Retry semantics. `--resume` already retries failed prompts on the next invocation; that's the existing contract.

Filed by hermes-maintainer (PowerCreek). PR incoming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch_runner: per-prompt failures are invisible after a run — statistics.json has tool-usage but no failure count #28

Context

Impact

Fix

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

batch_runner: per-prompt failures are invisible after a run — statistics.json has tool-usage but no failure count #28

Description

Context

Impact

Fix

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions