Skip to content

Windows: hermes gateway crashes on stale gateway.pid after abrupt shutdown (os.kill(pid, 0) -> OSError) #7227

@Yogioo

Description

@Yogioo

Summary

On Windows, hermes gateway can crash on startup after an abrupt shutdown/power-off because a stale gateway.pid is left behind and gateway/status.py:get_running_pid() uses os.kill(pid, 0) as the liveness probe.

When the PID is stale, Windows may raise a generic OSError instead of ProcessLookupError. In my case it raised WinError 11; probing the same stale PID from Git Bash also produced WinError 87 (The parameter is incorrect). Because get_running_pid() only treats ProcessLookupError / PermissionError as stale, the startup path crashes before it can remove the old PID file.

Deleting the stale gateway.pid / gateway_state.json immediately fixes startup.

Environment

  • Hermes Agent: v0.8.0 (2026.4.8)
  • Python: 3.12.11
  • OS: Windows 11 10.0.22621

Repro

  1. Start gateway normally.
  2. Hard power-off / abrupt shutdown so Hermes cannot clean up gateway.pid.
  3. Boot again and run:
    hermes gateway
  4. Startup crashes while checking the old PID.

Observed traceback

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\12737\AppData\Local\hermes\hermes-agent\venv\Scripts\hermes.exe\__main__.py", line 10, in <module>
  File "C:\Users\12737\AppData\Local\hermes\hermes-agent\hermes_cli\main.py", line 5671, in main
    args.func(args)
  File "C:\Users\12737\AppData\Local\hermes\hermes-agent\hermes_cli\main.py", line 670, in cmd_gateway
    gateway_command(args)
  File "C:\Users\12737\AppData\Local\hermes\hermes-agent\hermes_cli\gateway.py", line 2302, in gateway_command
    run_gateway(verbose, quiet=quiet, replace=replace)
  File "C:\Users\12737\AppData\Local\hermes\hermes-agent\hermes_cli\gateway.py", line 1341, in run_gateway
    success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\12737\AppData\Roaming\uv\python\cpython-3.12.11-windows-x86_64-none\Lib\asyncio\runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\12737\AppData\Roaming\uv\python\cpython-3.12.11-windows-x86_64-none\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\12737\AppData\Roaming\uv\python\cpython-3.12.11-windows-x86_64-none\Lib\asyncio\base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\12737\AppData\Local\hermes\hermes-agent\gateway\run.py", line 7717, in start_gateway
    existing_pid = get_running_pid()
                   ^^^^^^^^^^^^^^^^^
  File "C:\Users\12737\AppData\Local\hermes\hermes-agent\gateway\status.py", line 400, in get_running_pid
    os.kill(pid, 0)  # signal 0 = existence check, no actual signal sent
    ^^^^^^^^^^^^^^^
OSError: [WinError 11] An attempt was made to load a program with an incorrect format

Why this seems wrong

Current code in gateway/status.py:get_running_pid():

  • reads gateway.pid
  • calls os.kill(pid, 0)
  • only treats ProcessLookupError / PermissionError as “not running”

On Windows, stale/non-probeable PIDs can also raise plain OSError (WinError 11, WinError 87, possibly others). That should not hard-crash gateway startup.

Workaround

Delete:

  • %LOCALAPPDATA%\hermes\gateway.pid
  • %LOCALAPPDATA%\hermes\gateway_state.json

After deleting those files, hermes gateway starts normally again.

Suggested fix

Make PID liveness checks Windows-safe. At minimum, get_running_pid() should treat Windows OSError from os.kill(pid, 0) as stale and remove the PID file rather than crashing.

It may also be worth auditing the other os.kill(pid, 0) probes in gateway/profile code for the same assumption.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverysweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions