Skip to content

Windows gateway reliability: replace schtasks with a real Windows Service (pywin32 + SCM RecoveryActions) #40899

@teknium1

Description

@teknium1

Background

The Hermes gateway on Windows currently has no robust auto-restart on crash. The login Startup .cmd covers reboot/login persistence; a CREATE_BREAKAWAY_FROM_JOB fix (PR #TBD) covers the specific case of "GUI update SIGTERMs the gateway and the watcher survives Electron's job teardown to respawn it"; but if the gateway dies mid-session for any other reason (OOM, taskkill, native crash), the user must run hermes gateway start manually.

On Linux this is solved by systemd Restart=always. On macOS by launchd KeepAlive. On Windows the answer is a real Windows Service registered with the Service Control Manager and SCM-driven RecoveryAction for auto-restart. Tailscale does exactly this — see cmd/tailscaled/install_windows.go:

ra := []mgr.RecoveryAction{
    {mgr.ServiceRestart, 1 * time.Second},
    {mgr.ServiceRestart, 2 * time.Second},
    {mgr.ServiceRestart, 4 * time.Second},
    {mgr.ServiceRestart, 9 * time.Second},
    {mgr.ServiceRestart, 16 * time.Second},
    {mgr.ServiceRestart, 25 * time.Second},
    {mgr.ServiceRestart, 36 * time.Second},
    {mgr.ServiceRestart, 49 * time.Second},
    {mgr.ServiceRestart, 64 * time.Second},
}
service.SetRecoveryActions(ra, /*resetPeriodSecs=*/60)

Services run in session 0, so there is no possible console window — sidesteps the Task Scheduler conhost-flash issue entirely.

Why not Task Scheduler

We tried a per-minute schtasks supervisor task in #PR. It works functionally but flashes a console window that steals focus on every firing, even with all the documented mitigations (GUI-subsystem pythonw direct invoke, XML <Hidden>true</Hidden>, InteractiveToken, LeastPrivilege). Looking at prior art:

  • openclaw uses a VBS Run(..., 0, False) wrapper. Suppresses the window but Super User Q971162 confirms focus-steal still occurs in some cases.
  • Ollama doesn't use Task Scheduler at all — GUI tray exe with a Startup shortcut and an internal monitor+worker pair.
  • Tailscale uses a real Windows Service (this issue).
  • Syncthing uses --no-console flag + Startup folder.

Task Scheduler is the wrong tool for "Restart=always" on Windows.

What this issue tracks

Refactor hermes gateway install on Windows to register the gateway as a real Windows Service using pywin32's win32serviceutil.ServiceFramework, with SCM RecoveryActions for auto-restart. Concretely:

  1. Add a WindowsGatewayService(win32serviceutil.ServiceFramework) subclass that wraps the existing gateway run entrypoint.
  2. hermes gateway install on Windows calls win32serviceutil.InstallService + sets recovery actions (quadratic backoff like Tailscale).
  3. hermes gateway start|stop|restart|status route to sc.exe / SCM APIs instead of (or in addition to) schtasks for service-mode installs.
  4. Preserve the Startup .cmd fallback for users without admin rights to install a service.
  5. Decide whether to run in session 0 (no GUI access, simplest) or session-1 user context (requires SCM session brokering, more complex but lets the gateway show notifications etc.). Tailscale runs in session 0; we may need session 1 because the gateway tools can include GUI helpers.

Acceptance criteria

  • hermes gateway install on Windows (with admin) registers a service named HermesGateway (per profile: HermesGateway-<profile>).
  • Service has RecoveryActions configured for quadratic backoff over 60s.
  • hermes update no longer needs CREATE_BREAKAWAY_FROM_JOB for the gateway respawn watcher because the SCM does the restart — the watcher logic can be removed on Windows.
  • Killing the gateway PID externally (e.g. taskkill /F /PID <pid>) results in SCM respawning it within ~2 seconds.
  • No console window ever appears on the user's desktop for any of this.
  • Non-admin users get a graceful fallback to the existing Startup .cmd (no service install).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havearea/configConfig system, migrations, profilescomp/gatewayGateway runner, session dispatch, deliverytype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions