Context
Auditing the batch_runner surface per the gateway/cron/batch_runner sweep. Found a real diagnostic gap: failed prompts are only printed during execution and never persisted.
Looking at the flow in batch_runner.py:
_process_single_prompt (line 244) returns {success: False, error: <msg>} on failure.
_process_batch_worker (line 400) checks result["success"]. Only successes are written to batch_N.jsonl (line 486). Failures get a stdout ❌ Prompt N failed and are dropped — the error string isn't persisted anywhere.
- The worker return dict (line 516) tracks
processed, skipped, tool_stats, completed_prompts — but NOT failed count and NOT error samples.
- The final consolidation (line 1026+) reads only the success-jsonl entries, computes tool-usage stats, and writes
statistics.json (line 1073) WITHOUT any failure metric.
Impact
Run a 1000-prompt batch where 200 prompts fail due to (auth expired / rate-limit / schema mismatch / network blip). After the run:
statistics.json shows tool-usage stats only — no failure count, no failure rate.
trajectories.jsonl has 800 entries — operator has to know the input size to discover 200 are missing.
- Per-prompt errors exist nowhere on disk. The only signal is scrolled-past
❌ lines in the run output.
Operators training on trajectories.jsonl would silently work with a biased dataset (the 800 prompts that DIDN'T hit a transient failure are over-represented).
Fix
Three small additions:
_process_batch_worker accumulates failed_in_batch count + an error_samples list (capped at 3 per batch).
- Returned in the worker result dict alongside the existing
processed / skipped / etc.
BatchRunner.run() aggregates total_failed and the first ~10 error_samples into final_stats as prompts_failed + failure_rate + error_samples, and prints a summary line below the existing "Prompts processed this run" line.
Backward-compatible: existing keys in statistics.json are untouched; new keys are additive. Existing checkpoint format is unchanged.
Out of scope
- Per-prompt error logging to disk beyond the sample. Operators who need full failure detail should add their own logging hook to
_process_single_prompt; a 100MB error log per run is a different design.
- Retry semantics.
--resume already retries failed prompts on the next invocation; that's the existing contract.
Filed by hermes-maintainer (PowerCreek). PR incoming.
Context
Auditing the batch_runner surface per the gateway/cron/batch_runner sweep. Found a real diagnostic gap: failed prompts are only printed during execution and never persisted.
Looking at the flow in
batch_runner.py:_process_single_prompt(line 244) returns{success: False, error: <msg>}on failure._process_batch_worker(line 400) checksresult["success"]. Only successes are written tobatch_N.jsonl(line 486). Failures get a stdout❌ Prompt N failedand are dropped — theerrorstring isn't persisted anywhere.processed,skipped,tool_stats,completed_prompts— but NOTfailedcount and NOTerrorsamples.statistics.json(line 1073) WITHOUT any failure metric.Impact
Run a 1000-prompt batch where 200 prompts fail due to (auth expired / rate-limit / schema mismatch / network blip). After the run:
statistics.jsonshows tool-usage stats only — no failure count, no failure rate.trajectories.jsonlhas 800 entries — operator has to know the input size to discover 200 are missing.❌lines in the run output.Operators training on
trajectories.jsonlwould silently work with a biased dataset (the 800 prompts that DIDN'T hit a transient failure are over-represented).Fix
Three small additions:
_process_batch_workeraccumulatesfailed_in_batchcount + anerror_sampleslist (capped at 3 per batch).processed/skipped/ etc.BatchRunner.run()aggregatestotal_failedand the first ~10error_samplesintofinal_statsasprompts_failed+failure_rate+error_samples, and prints a summary line below the existing "Prompts processed this run" line.Backward-compatible: existing keys in
statistics.jsonare untouched; new keys are additive. Existing checkpoint format is unchanged.Out of scope
_process_single_prompt; a 100MB error log per run is a different design.--resumealready retries failed prompts on the next invocation; that's the existing contract.Filed by hermes-maintainer (PowerCreek). PR incoming.