Skip to content

batch_runner: per-prompt failures are invisible after a run — statistics.json has tool-usage but no failure count #28

@PowerCreek

Description

@PowerCreek

Context

Auditing the batch_runner surface per the gateway/cron/batch_runner sweep. Found a real diagnostic gap: failed prompts are only printed during execution and never persisted.

Looking at the flow in batch_runner.py:

  • _process_single_prompt (line 244) returns {success: False, error: <msg>} on failure.
  • _process_batch_worker (line 400) checks result["success"]. Only successes are written to batch_N.jsonl (line 486). Failures get a stdout ❌ Prompt N failed and are dropped — the error string isn't persisted anywhere.
  • The worker return dict (line 516) tracks processed, skipped, tool_stats, completed_prompts — but NOT failed count and NOT error samples.
  • The final consolidation (line 1026+) reads only the success-jsonl entries, computes tool-usage stats, and writes statistics.json (line 1073) WITHOUT any failure metric.

Impact

Run a 1000-prompt batch where 200 prompts fail due to (auth expired / rate-limit / schema mismatch / network blip). After the run:

  • statistics.json shows tool-usage stats only — no failure count, no failure rate.
  • trajectories.jsonl has 800 entries — operator has to know the input size to discover 200 are missing.
  • Per-prompt errors exist nowhere on disk. The only signal is scrolled-past lines in the run output.

Operators training on trajectories.jsonl would silently work with a biased dataset (the 800 prompts that DIDN'T hit a transient failure are over-represented).

Fix

Three small additions:

  1. _process_batch_worker accumulates failed_in_batch count + an error_samples list (capped at 3 per batch).
  2. Returned in the worker result dict alongside the existing processed / skipped / etc.
  3. BatchRunner.run() aggregates total_failed and the first ~10 error_samples into final_stats as prompts_failed + failure_rate + error_samples, and prints a summary line below the existing "Prompts processed this run" line.

Backward-compatible: existing keys in statistics.json are untouched; new keys are additive. Existing checkpoint format is unchanged.

Out of scope

  • Per-prompt error logging to disk beyond the sample. Operators who need full failure detail should add their own logging hook to _process_single_prompt; a 100MB error log per run is a different design.
  • Retry semantics. --resume already retries failed prompts on the next invocation; that's the existing contract.

Filed by hermes-maintainer (PowerCreek). PR incoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions