Skip to content

fix: guard print() calls in run_conversation() against OSError when stdout is unavailable (systemd/headless) #845

@GeorgL0ngGamma

Description

@GeorgL0ngGamma

Problem

When hermes-agent runs as a systemd service (StandardOutput=journal) and the journal pipe becomes unavailable (idle timeout, buffer exhaustion, socket reset), any print() call inside run_conversation() raises OSError: [Errno 5] Input/output error. This is a realistic production condition for any headless daemon deployment (systemd, Docker, nohup).

Two calls in run_agent.py sit in the critical failure path for cron jobs running with quiet_mode=True:

Line ~4062 — inside quiet_mode branch

if self.quiet_mode:
    clean = self._strip_think_blocks(turn_content).strip()
    if clean:
        print(f"  ┊ 💬 {clean}")  # raises OSError when stdout pipe is broken

This fires during any tool-calling turn when the model produces intermediate commentary. The OSError becomes the exception e caught by the outer except Exception handler.

Line ~4228 — in except Exception error handler

except Exception as e:
    error_msg = f"Error during OpenAI-compatible API call #{api_call_count}: {str(e)}"
    print(f"❌ {error_msg}")  # also raises OSError — now propagates out of run_conversation()

When the OSError from line ~4062 arrives here as e, this second print() also raises OSError. This propagates out of run_conversation() entirely, causing the cron scheduler to mark the job as status: "error" — the agent's completed work is never delivered.

Additional unguarded print() calls in the same hot loop

The same pattern exists at several points not gated by quiet_mode, reachable during any cron job run:

Approx. line Triggered by
~4064 Model context length discovery (first run per model)
~4108 Interrupt received during API call
~4153–4161 Any API retry (rate limit, timeout, network error) — most likely in production
~4166 Interrupt detected during retry error handling
~4458 All API retries exhausted

Lines ~4153–4161 are the highest-risk: they fire on every transient API error (rate limits, network timeouts), which are common in production.

Observed failure

Confirmed on a deployment running as a systemd user service (StandardOutput=journal). Cron jobs scheduled at 06:00 and 13:00 UTC (when the system is idle and the journal pipe is stale) fail consistently with this traceback in the output file:

File "run_agent.py", line 4062, in run_conversation
    print(f"  ┊ 💬 {clean}")
OSError: [Errno 5] Input/output error

During handling of the above exception, another exception occurred:

File "run_agent.py", line 4228, in run_conversation
    print(f"❌ {error_msg}")
OSError: [Errno 5] Input/output error

The same jobs run successfully at 22:00 UTC when the system has active user sessions and the journal pipe is healthy — confirming this is an environmental stdout availability issue, not a logic bug.

Fix

Wrap each affected print() in try/except OSError, falling back to logger.error() for calls inside error handlers (where losing the message would hide the root cause):

# Cosmetic lines (quiet_mode display, status messages) — silent drop is fine:
try:
    print(f"  ┊ 💬 {clean}")
except OSError:
    pass

# Error handler lines — must not lose the message:
try:
    print(f"❌ {error_msg}")
except OSError:
    logger.error(error_msg)

# API retry error block (~4153–4161) — same pattern:
try:
    print(f"{self.log_prefix}⚠️  API call failed ...")
except OSError:
    logger.warning(...)

Related issues

This is more severe than those issues because it actively crashes the job rather than just losing a log line.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions