Skip to content

fix(cli): graceful fallback when no Win32 console buffer (headless services)#25340

Closed
zw11591-sketch wants to merge 4 commits into
NousResearch:mainfrom
zw11591-sketch:fix/cli-noconsole-headless-windows-service
Closed

fix(cli): graceful fallback when no Win32 console buffer (headless services)#25340
zw11591-sketch wants to merge 4 commits into
NousResearch:mainfrom
zw11591-sketch:fix/cli-noconsole-headless-windows-service

Conversation

@zw11591-sketch

Copy link
Copy Markdown

Summary

Make cli.py survive on Windows when there is no Win32 console screen buffer attached to the process — the situation every hermes child of a Windows service / scheduled task / DETACHED_PROCESS parent finds itself in. Previously these processes died at the very first colored line, before AIAgent was even constructed.

The fix is a small wrapper around _pt_print(_PT_ANSI(text)) (_safe_pt_print()) that catches prompt_toolkit.output.win32.NoConsoleScreenBufferError and falls back to plain sys.stdout.write. Interactive-terminal behavior is byte-identical to before.

Repro (before this PR)

In any context that has no Win32 console buffer (e.g. an NSSM-wrapped Windows service spawning a child, a Task Scheduler task with no UI session, or subprocess.Popen(..., creationflags=DETACHED_PROCESS|CREATE_NO_WINDOW)):

hermes -p ops chat -q "hi"

…immediately exits with:

File "...\hermes-agent\cli.py", line 1528, in _cprint
    _pt_print(_PT_ANSI(text))
File "...\prompt_toolkit\shortcuts\utils.py", line 111, in print_formatted_text
    output = get_app_session().output
File "...\prompt_toolkit\application\current.py", line 67, in output
    self._output = create_output()
File "...\prompt_toolkit\output\defaults.py", line 91, in create_output
    return Win32Output(stdout, default_color_depth=color_depth_from_env)
File "...\prompt_toolkit\output\win32.py", line 219, in get_win32_screen_buffer_info
    raise NoConsoleScreenBufferError
prompt_toolkit.output.win32.NoConsoleScreenBufferError: No Windows console found. Are you running cmd.exe?

The crash happens at the first _cprint() call inside HermesCLI.chat (_cprint(f"{_DIM}Initializing agent...{_RST}")), so:

  • Initializing agent... is never even printed.
  • AIAgent.__init__ never runs.
  • No agent log lines past plugin discovery + .env load.

Real-world impact

This is what motivated the fix. The kanban dispatcher (gateway/run.py -> plugins/kanban/...) launches each worker as hermes -p <profile> chat -q "work kanban task <id>". When the gateway runs under NSSM as a Windows service:

  1. Dispatcher claims the task, spawns the child.
  2. Child crashes inside ~1s on _cprint. PID is gone before any agent activity.
  3. Dispatcher logs crashed=1 (only sees pid X not alive, no traceback in dispatcher log because the traceback went to the child's vanished stdout).
  4. After max_retries, the task is auto-blocked.
  5. Every kanban task on the board ends up blocked regardless of profile.

End-to-end verification of the fix:

  • A task that had been auto-blocked with Agent crashed 2x: pid not alive was unblocked.
  • On the next dispatcher tick the same hermes -p ops chat -q "..." invocation succeeded.
  • ops worker ran 12 LLM calls and 8 tool calls (including a real docker build + docker run) and called kanban_complete. Task transitioned to done.

What this PR does

cli.py only:

  1. Add a _safe_pt_print(text) helper near the top of the file. It calls _pt_print(_PT_ANSI(text)) inside try/except:
    • On NoConsoleScreenBufferError -> sys.stdout.write(text + "\n"); sys.stdout.flush().
    • On any other Exception from prompt_toolkit / IO -> same fallback. A failed colored print must never kill the agent.
  2. Wrap the NoConsoleScreenBufferError import in try/except with a stub fallback class, so non-Windows installs (where prompt_toolkit.output.win32 may not be importable) still load cli.py cleanly.
  3. Replace the 7 raw _pt_print(_PT_ANSI(text)) call sites inside _cprint() with _safe_pt_print(text). These are the sites that fire on every line printed via _cprint — banner, spinner frames, tool activity prefixes, streamed reasoning, etc.

The single remaining _pt_print(_PT_ANSI(str(line))) call site outside _cprint (the body of _emit_lines_to_pt, which is only reached after a prompt_toolkit Application is already running) is not changed in this PR — the issue does not surface there in practice and I'd rather keep the diff focused on the actually-broken path. Happy to extend if reviewers prefer.

Diff stats: cli.py | 58 ++++++++++++++++++++++++++++++++++++++++------- (51 insertions, 7 modifications, 0 deletions of pre-existing logic).

Why a wrapper instead of patching prompt_toolkit / forcing a fake console

  • Wrapping _pt_print is the smallest possible change that fixes the crash without touching prompt_toolkit internals or assuming anything about the runtime.
  • Fake-console approaches (winpty, pywinpty, AllocConsole) would have to be applied in every caller that ever spawns hermes as a child — gateway, dispatcher, NSSM wrappers, user scripts. The wrapper localizes the fix to cli.py.
  • The wrapper is a no-op on the happy path: the inner call is the original _pt_print(_PT_ANSI(text)). Attached terminals (interactive cmd.exe, Windows Terminal, mintty/MSYS, conhost) hit the inner call and exit the try immediately — no measurable overhead.

Risks

  • If the interactive path ever raised a non-NoConsoleScreenBufferError exception that the surrounding _cprint code relied on (e.g. via the except Exception branches at L1556, L1612), the new wrapper will swallow it and fall back to sys.stdout.write instead of letting the existing fallback _pt_print(_PT_ANSI(text)) run. In practice every existing _cprint fallback path already ends in _pt_print(_PT_ANSI(text)) — the wrapper just makes those last-resort prints themselves crash-safe. I think this is strictly better but it's the one behavior change worth a second pair of eyes.
  • The wrapper imports sys under an alias (_sys_for_pt_fallback) to make absolutely sure no later import sys as ... shadow can break the fallback. Aesthetically ugly; happy to drop the alias if reviewers prefer.

Testing

Manual:

  • Repro: subprocess.Popen([hermes_exe, "-p", "ops", "chat", "-q", "..."], creationflags=DETACHED_PROCESS|CREATE_NO_WINDOW, stdin=DEVNULL, stdout=PIPE, stderr=PIPE) — succeeds, Initializing agent... and the full agent response appear in stdout, exit code 0. Without the patch the same call exits non-zero with the traceback above.
  • Real workload: kanban dispatcher under NSSM service runs a docker build task end-to-end via the ops profile.
  • Interactive: hermes in Windows Terminal, mintty (Git Bash), and cmd.exe — banner/colors render identically to main.

I haven't added a unit test because exercising NoConsoleScreenBufferError reliably needs a non-trivial subprocess harness (the import itself works fine inside pytest — only create_output() blows up, and only when sys.stdout lacks a real console handle). Happy to add one in tests/ if reviewers point me at the right shape.

Environment

  • Windows 11
  • Python 3.12
  • prompt_toolkit whatever ships with current requirements
  • hermes-agent rebased on dd5a9502e (current main at PR open)

zw11591-sketch and others added 2 commits May 14, 2026 09:37
…rvices)

When hermes runs as a child of a Windows service (NSSM, scheduled task,
or any LocalSystem context), the child process inherits no Win32 console
screen buffer. prompt_toolkit's print_formatted_text() lazily calls
create_output() -> Win32Output.__init__ -> get_win32_screen_buffer_info()
which raises NoConsoleScreenBufferError. The exception was uncaught at
every _pt_print(_PT_ANSI(text)) call site inside _cprint(), so the very
first colored line ("Initializing agent..." at the top of HermesCLI.chat)
killed the process before AIAgent could even be constructed.

This blocked the kanban dispatcher running under a Windows service: every
spawned child agent died at startup, the dispatcher reclaimed the lock,
retried, hit max_retries, and auto-blocked the task. Repro:

    # In a service / no-console context (DETACHED_PROCESS|CREATE_NO_WINDOW):
    hermes -p <profile> chat -q "hi"
    # -> prompt_toolkit.output.win32.NoConsoleScreenBufferError
    # -> exit before agent loop runs

Fix: introduce _safe_pt_print() that wraps _pt_print(_PT_ANSI(text)) in
a try/except. On NoConsoleScreenBufferError (or any other prompt_toolkit
runtime failure) it falls back to plain sys.stdout.write + flush. Replace
the 7 raw _pt_print(_PT_ANSI(text)) call sites inside _cprint() with
_safe_pt_print(text). Behavior in normal interactive terminals is
unchanged - the wrapped path is identical to the prior call.

The NoConsoleScreenBufferError import is itself wrapped in try/except so
non-Windows installs (where prompt_toolkit.output.win32 doesn't exist as
a usable module) still import cli.py cleanly.

Verified end-to-end: blocked kanban task t_3231a03f went from 'crashed
2x: pid not alive' to 'done' in one dispatcher cycle, with the spawned
ops profile completing 12 LLM calls and 8 tool calls including a
successful docker build.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard labels May 14, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #22482 (same fix for #22445 — Win32 NoConsoleScreenBufferError fallback). Also overlaps with #22928 (--no-gui flag approach). #23011 was already closed as dup of #22482.

Resolve cli.py conflicts in print_above_input_no_redraw():
- Keep _safe_pt_print() wrapper (handles NoConsoleScreenBufferError for
  headless Windows services without Win32 console buffer).
- Preserve main's ensure_future fix for run_in_terminal coroutine
  (NousResearch#23185 Bug A) by wrapping _safe_pt_print inside the run_in_terminal
  lambda instead of _pt_print(_PT_ANSI(...)).

Both fixes compose cleanly: headless services still degrade to bare
print(), and cross-thread emissions no longer drop output silently.

Verified: tests/cli/test_cprint_bg_thread.py (13/13),
test_cli_init.py + test_cli_force_redraw.py (47/47).
@teknium1

Copy link
Copy Markdown
Contributor

Automated hermes-sweeper review found this is implemented on current main.

Evidence:

  • cli.py:2224 now wraps the direct _pt_print(_PT_ANSI(text)) path in _cprint() with try/except and falls back to plain print(text) when prompt_toolkit output setup fails. The inline comment explicitly covers redirected subprocess worker logging and NoConsoleScreenBufferError / OSError.
  • cli.py:9847 still sends the reported startup line, Initializing agent..., through _cprint(), so the current fallback covers the crash that happened before AIAgent construction.
  • hermes_cli/kanban_db.py:6828 still matches the kanban worker scenario from the PR: workers are spawned with stdout=log_f, stderr=subprocess.STDOUT, and CREATE_NO_WINDOW on Windows.
  • The overlapping fix merged via fix(windows): handle redirected stdout in _cprint fallback (#28083) #28330 as 8bcb6082acf83db8e199f1065fa065340d82de0d (fix(windows): handle redirected stdout in _cprint fallback) and is included in v2026.5.28 and later.

Thanks for the detailed repro and verification notes. The prior duplicate discussion in this PR (#22482 / #22445 / #22928 / #23011) was useful in confirming this is the same bug class now covered on main.

@teknium1 teknium1 closed this Jun 12, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants