Skip to content

test(docker): repair dashboard tests broken by the insecure-opt-in fix#34204

Merged
benbarclay merged 1 commit into
mainfrom
fix-docker-tests
May 29, 2026
Merged

test(docker): repair dashboard tests broken by the insecure-opt-in fix#34204
benbarclay merged 1 commit into
mainfrom
fix-docker-tests

Conversation

@benbarclay

Copy link
Copy Markdown
Collaborator

The Docker integration test job has been failing on main since fb51253 ("docker: opt in to dashboard --insecure via env var"). Failed run.

Two distinct failures, both fallout from that change being more behaviour-changing than the existing test harness anticipated.

Failure 1 — test_dashboard_port_override (silent regression in an already-existing test)

The test starts the container with just HERMES_DASHBOARD=1, defaults to host=0.0.0.0, no HERMES_DASHBOARD_OAUTH_CLIENT_ID, no HERMES_DASHBOARD_INSECURE. Pre-fix that combination got --insecure auto-injected by the s6 run script (anything non-loopback was implicitly insecure), so the OAuth gate stayed off and start_server bound the port. Post-fix the gate engages, no provider is registered, and start_server raises SystemExit before binding — under s6 the dashboard goes into a restart loop and the test's /proc/net/tcp poll finds nothing.

Same silent regression was masking three sibling tests (test_dashboard_slot_reports_up_when_enabled, test_dashboard_opt_in_starts, test_dashboard_restarts_after_crash) — they all only sample pgrep or s6-svstat and so caught the supervised process mid-restart-loop, appearing to pass while the dashboard was actually never reaching a healthy state.

Fix: pin HERMES_DASHBOARD_INSECURE=1 on every test that enables the dashboard but doesn't itself exercise the auth gate. Each pinned site carries an inline comment pointing back to test_dashboard_slot_reports_up_when_enabled for the full rationale.

Failure 2 — test_dashboard_oauth_gate_engages_on_non_loopback_bind (bug in the test I added in fb51253)

The probe used urllib.request.urlopen() against /api/status. Under the now-engaged OAuth gate /api/status no longer answers unauthenticated callers (the gate middleware runs upstream of the legacy _SESSION_TOKEN allowlist and 401s anything without a valid session cookie). urlopen() raises HTTPError on the 401, the wrapper treated that as "not ready yet", and the poll loop hit timeout.

Fix: split the probe into a generic _http_probe() helper that returns (status_code, body) for any HTTP response — including 401, which IS the gate-engaged success signal. The helper feeds a multi-line Python program over stdin via a POSIX heredoc so the try/except branch reads naturally; far less fragile than the earlier semicolon-laden -c one-liner.

The OAuth-gate test now verifies two independent observable consequences of the gate being on:

  1. GET /api/auth/providers (publicly reachable through the gate so the login page can bootstrap) returns 200 with nous in the provider list — proves the bundled provider registered.
  2. GET /api/status returns 401 — proves the OAuth gate runs upstream of the legacy public-paths allowlist and is actively intercepting unauthenticated callers.

The insecure-opt-out test still hits /api/status, but now asserts status_code == 200 first (proves the gate is bypassed) before parsing the JSON for auth_required: false (proves the gate-state flag is also correctly off).

Verification

Verified locally end-to-end against a fresh image build on a real Docker daemon: all 41 tests under tests/docker/ pass in 2m38s, including the two formerly-failing dashboard tests and the three sibling tests that were passing by accident.

======================== 41 passed in 157.93s (0:02:37) ========================

The Docker integration test job started failing on main after
fb51253 ("docker: opt in to dashboard --insecure via env var").
Two distinct failures, both fallout from that change being more
behaviour-changing than the existing test harness anticipated.

Failure 1 — test_dashboard_port_override (silent regression in an
already-existing test)
The test starts the container with just HERMES_DASHBOARD=1, defaults
to host=0.0.0.0, no HERMES_DASHBOARD_OAUTH_CLIENT_ID, no
HERMES_DASHBOARD_INSECURE. Pre-fix that combination got --insecure
auto-injected by the s6 run script (anything non-loopback was
implicitly insecure), so the OAuth gate stayed off and start_server
bound the port. Post-fix the gate engages, no provider is
registered, and start_server raises SystemExit before binding —
under s6 the dashboard goes into a restart loop and the test's
/proc/net/tcp poll finds nothing.

Same silent regression was masking three sibling tests
(test_dashboard_slot_reports_up_when_enabled, test_dashboard_opt_in_starts,
test_dashboard_restarts_after_crash) — they all only sample pgrep
or s6-svstat and so caught the supervised process mid-restart
loop, appearing to pass while the dashboard was actually never
reaching a healthy state.

Fix: pin HERMES_DASHBOARD_INSECURE=1 on every test that enables
the dashboard but doesn't itself exercise the auth gate. Each
pinned site carries an inline comment pointing back to
test_dashboard_slot_reports_up_when_enabled for the full
rationale.

Failure 2 — test_dashboard_oauth_gate_engages_on_non_loopback_bind
(bug in the test I added in fb51253)
The probe used urllib.request.urlopen() against /api/status. Under
the now-engaged OAuth gate /api/status no longer answers
unauthenticated callers (the gate middleware runs upstream of the
legacy _SESSION_TOKEN allowlist and 401s anything without a valid
session cookie). urlopen() raises HTTPError on the 401, the wrapper
treated that as "not ready yet", and the poll loop hit
timeout.

Fix: split the probe into a generic _http_probe() helper that
returns (status_code, body) for any HTTP response — including 401,
which IS the gate-engaged success signal. The helper feeds a
multi-line Python program over stdin via a POSIX heredoc so the
try/except branch reads naturally; far less fragile than the
earlier semicolon-laden -c one-liner.

The OAuth-gate test now verifies two independent observable
consequences of the gate being on:

  1. GET /api/auth/providers (publicly reachable through the gate
     so the login page can bootstrap) returns 200 with `nous` in
     the provider list — proves the bundled provider registered.
  2. GET /api/status returns 401 — proves the OAuth gate runs
     upstream of the legacy public-paths allowlist and is
     actively intercepting unauthenticated callers.

The insecure-opt-out test still hits /api/status, but now
asserts status_code == 200 first (proves the gate is bypassed)
before parsing the JSON for auth_required: false (proves the
gate-state flag is also correctly off).

Verified locally end-to-end against a fresh image build on a
real Docker daemon: all 41 tests under tests/docker/ pass in
2m38s, including the two formerly-failing dashboard tests and
the three sibling tests that were passing by accident.
@benbarclay benbarclay merged commit 1b1e305 into main May 29, 2026
19 of 22 checks passed
@benbarclay benbarclay deleted the fix-docker-tests branch May 29, 2026 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant