Skip to content

[Feature]: Make Docker HEALTHCHECK mode-aware for gateway and dashboard #9751

@luisalrp

Description

@luisalrp

Problem or Use Case

Hermes’ Docker image can now expose a generic health signal based on whether PID 1 is alive, but that is still weaker than what Docker could report for the two main long-running service modes: hermes gateway run and hermes dashboard.

For those modes, process liveness is not always the best signal:

  • a gateway process may still exist while Hermes has already recorded startup_failed or otherwise is not operational
  • a dashboard process may still exist even if the local web server is not actually serving requests

As a result, Docker health status is less useful than it could be for Compose deployments, dashboards, restart policies, and operational monitoring. Users running Hermes as a service would benefit from an application-level healthcheck when Hermes is in a known service mode, while still preserving a safe fallback for interactive CLI and one-off commands.

Proposed Solution

Make the Docker healthcheck script mode-aware by inspecting PID 1’s command line and selecting the probe strategy based on the active Hermes mode.

Suggested behavior:

  • If PID 1 is running hermes gateway run:
    • read the gateway runtime status from Hermes’ persisted status file
    • report healthy only when the gateway state indicates a live running gateway
    • report unhealthy for states like startup_failed, missing/stale state, or a dead gateway process
  • If PID 1 is running hermes dashboard:
    • probe the local dashboard server with an HTTP request
    • use the existing dashboard status endpoint, e.g. GET /api/status
    • report healthy only when the local dashboard responds successfully
  • For all other commands:
    • fall back to the generic process-level check
    • healthy if PID 1 exists and is not a zombie

Implementation outline:

  • keep the Dockerfile HEALTHCHECK pointing to a single docker/healthcheck.sh
  • in the script, inspect /proc/1/cmdline to detect the active mode
  • branch to:
    • gateway-aware probe
    • dashboard-aware probe
    • generic fallback
  • document the behavior clearly in the Docker docs so users understand that health semantics depend on the Hermes mode being run

This should remain conservative:

  • no assumptions about arbitrary one-off subcommands
  • no requirement that every container mode expose HTTP
  • no breaking behavior for interactive CLI usage

Alternatives Considered

Keep the current generic PID 1 healthcheck for every mode. This is simple and safe, but it does not tell Docker whether the gateway or dashboard is actually operational.

Use an HTTP-only healthcheck for all modes. This does not fit Hermes because many supported container invocations do not run an HTTP service.

Use a gateway-only healthcheck. That would improve the most common service mode, but it would still miss the dashboard case and would make the image behavior less consistent across supported long-running modes.

Feature Type

Performance / reliability

Scope

Medium (few files, < 300 lines)

Contribution

  • I'd like to implement this myself and submit a PR

Debug Report (optional)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions