Skip to content

test(gateway): derive expected TimeoutStopSec from configured drain timeout#18972

Closed
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/main-ci-systemd-timeoutstopsec-assertion
Closed

test(gateway): derive expected TimeoutStopSec from configured drain timeout#18972
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/main-ci-systemd-timeoutstopsec-assertion

Conversation

@Sanjays2402

Copy link
Copy Markdown
Contributor

Summary

Fixes two Tests failures observed on main (and therefore propagating to every open PR):

FAILED tests/hermes_cli/test_gateway_service.py::TestGeneratedSystemdUnits::test_user_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout
FAILED tests/hermes_cli/test_gateway_service.py::TestGeneratedSystemdUnits::test_system_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout

Reference run: 25250051126 on 5d3be898a.

Root cause

hermes_cli/gateway.py:1635 computes:

_drain_timeout = int(_get_restart_drain_timeout() or 0)
restart_timeout = max(60, _drain_timeout) + 30

The default agent.restart_drain_timeout was bumped to 180s, which makes the rendered systemd unit emit TimeoutStopSec=210. The two assertions in TestGeneratedSystemdUnits were still pinned to TimeoutStopSec=90 (formerly correct when the default was 60s), so they fail on every CI run.

Fix

Compute the expected value from the same helper the unit generator uses:

@staticmethod
def _expected_timeout_stop_sec() -> int:
    drain = int(gateway_cli._get_restart_drain_timeout() or 0)
    return max(60, drain) + 30

…then assert f"TimeoutStopSec={self._expected_timeout_stop_sec()}" in unit. The assertion now tracks the configured default rather than hardcoding a literal.

Validation

$ pytest tests/hermes_cli/test_gateway_service.py::TestGeneratedSystemdUnits -q
3 passed in 1.82s

Scope

Out of scope

The other ~13 main-CI failures (gateway-restart kill semantics, dotenv vs os.environ, Dockerfile pid1, etc.) — happy to send those as separate focused PRs.

…imeout

The generated systemd unit's TimeoutStopSec is computed by
hermes_cli.gateway as max(60, _get_restart_drain_timeout()) + 30 to
keep enough headroom over the agent's drain phase that systemd doesn't
SIGKILL the cgroup before post-interrupt cleanup (tool-subprocess kill,
adapter disconnect) runs — see issue NousResearch#8202.

The default agent.restart_drain_timeout was bumped to 180s, which
makes the unit emit TimeoutStopSec=210, but the assertions in
``test_{user,system}_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout``
were still pinned to the previous ``TimeoutStopSec=90`` value.

Compute the expected value from the same helper the unit generator
uses, so the assertion tracks the configured default instead of
hardcoding a stale literal.

No production code change. Fixes the two assertion failures observed on
``main`` (run 25250051126).
@teknium1

Copy link
Copy Markdown
Contributor

Closing as superseded by #19936.

Triage notes (high confidence):
tests/hermes_cli/test_gateway_service.py:314-316 on main already defines _expected_timeout_stop_sec() deriving TimeoutStopSec from DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT, landed via merged PR #19936 (fix(gateway): handle planned service stops).

Thanks for the contribution — the underlying problem this PR addresses has been resolved by the linked PR on current main. If you believe this was closed in error, please comment and we'll reopen.

(Bulk-closed during a CLI PR triage sweep.)

@teknium1 teknium1 closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/test Test coverage or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants