test(docker): repair dashboard tests broken by the insecure-opt-in fix#34204
Merged
Conversation
The Docker integration test job started failing on main after fb51253 ("docker: opt in to dashboard --insecure via env var"). Two distinct failures, both fallout from that change being more behaviour-changing than the existing test harness anticipated. Failure 1 — test_dashboard_port_override (silent regression in an already-existing test) The test starts the container with just HERMES_DASHBOARD=1, defaults to host=0.0.0.0, no HERMES_DASHBOARD_OAUTH_CLIENT_ID, no HERMES_DASHBOARD_INSECURE. Pre-fix that combination got --insecure auto-injected by the s6 run script (anything non-loopback was implicitly insecure), so the OAuth gate stayed off and start_server bound the port. Post-fix the gate engages, no provider is registered, and start_server raises SystemExit before binding — under s6 the dashboard goes into a restart loop and the test's /proc/net/tcp poll finds nothing. Same silent regression was masking three sibling tests (test_dashboard_slot_reports_up_when_enabled, test_dashboard_opt_in_starts, test_dashboard_restarts_after_crash) — they all only sample pgrep or s6-svstat and so caught the supervised process mid-restart loop, appearing to pass while the dashboard was actually never reaching a healthy state. Fix: pin HERMES_DASHBOARD_INSECURE=1 on every test that enables the dashboard but doesn't itself exercise the auth gate. Each pinned site carries an inline comment pointing back to test_dashboard_slot_reports_up_when_enabled for the full rationale. Failure 2 — test_dashboard_oauth_gate_engages_on_non_loopback_bind (bug in the test I added in fb51253) The probe used urllib.request.urlopen() against /api/status. Under the now-engaged OAuth gate /api/status no longer answers unauthenticated callers (the gate middleware runs upstream of the legacy _SESSION_TOKEN allowlist and 401s anything without a valid session cookie). urlopen() raises HTTPError on the 401, the wrapper treated that as "not ready yet", and the poll loop hit timeout. Fix: split the probe into a generic _http_probe() helper that returns (status_code, body) for any HTTP response — including 401, which IS the gate-engaged success signal. The helper feeds a multi-line Python program over stdin via a POSIX heredoc so the try/except branch reads naturally; far less fragile than the earlier semicolon-laden -c one-liner. The OAuth-gate test now verifies two independent observable consequences of the gate being on: 1. GET /api/auth/providers (publicly reachable through the gate so the login page can bootstrap) returns 200 with `nous` in the provider list — proves the bundled provider registered. 2. GET /api/status returns 401 — proves the OAuth gate runs upstream of the legacy public-paths allowlist and is actively intercepting unauthenticated callers. The insecure-opt-out test still hits /api/status, but now asserts status_code == 200 first (proves the gate is bypassed) before parsing the JSON for auth_required: false (proves the gate-state flag is also correctly off). Verified locally end-to-end against a fresh image build on a real Docker daemon: all 41 tests under tests/docker/ pass in 2m38s, including the two formerly-failing dashboard tests and the three sibling tests that were passing by accident.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Docker integration test job has been failing on main since fb51253 ("docker: opt in to dashboard
--insecurevia env var"). Failed run.Two distinct failures, both fallout from that change being more behaviour-changing than the existing test harness anticipated.
Failure 1 —
test_dashboard_port_override(silent regression in an already-existing test)The test starts the container with just
HERMES_DASHBOARD=1, defaults tohost=0.0.0.0, noHERMES_DASHBOARD_OAUTH_CLIENT_ID, noHERMES_DASHBOARD_INSECURE. Pre-fix that combination got--insecureauto-injected by the s6 run script (anything non-loopback was implicitly insecure), so the OAuth gate stayed off andstart_serverbound the port. Post-fix the gate engages, no provider is registered, andstart_serverraisesSystemExitbefore binding — under s6 the dashboard goes into a restart loop and the test's/proc/net/tcppoll finds nothing.Same silent regression was masking three sibling tests (
test_dashboard_slot_reports_up_when_enabled,test_dashboard_opt_in_starts,test_dashboard_restarts_after_crash) — they all only samplepgrepors6-svstatand so caught the supervised process mid-restart-loop, appearing to pass while the dashboard was actually never reaching a healthy state.Fix: pin
HERMES_DASHBOARD_INSECURE=1on every test that enables the dashboard but doesn't itself exercise the auth gate. Each pinned site carries an inline comment pointing back totest_dashboard_slot_reports_up_when_enabledfor the full rationale.Failure 2 —
test_dashboard_oauth_gate_engages_on_non_loopback_bind(bug in the test I added in fb51253)The probe used
urllib.request.urlopen()against/api/status. Under the now-engaged OAuth gate/api/statusno longer answers unauthenticated callers (the gate middleware runs upstream of the legacy_SESSION_TOKENallowlist and 401s anything without a valid session cookie).urlopen()raisesHTTPErroron the 401, the wrapper treated that as "not ready yet", and the poll loop hit timeout.Fix: split the probe into a generic
_http_probe()helper that returns(status_code, body)for any HTTP response — including 401, which IS the gate-engaged success signal. The helper feeds a multi-line Python program over stdin via a POSIX heredoc so the try/except branch reads naturally; far less fragile than the earlier semicolon-laden-cone-liner.The OAuth-gate test now verifies two independent observable consequences of the gate being on:
GET /api/auth/providers(publicly reachable through the gate so the login page can bootstrap) returns 200 withnousin the provider list — proves the bundled provider registered.GET /api/statusreturns 401 — proves the OAuth gate runs upstream of the legacy public-paths allowlist and is actively intercepting unauthenticated callers.The insecure-opt-out test still hits
/api/status, but now assertsstatus_code == 200first (proves the gate is bypassed) before parsing the JSON forauth_required: false(proves the gate-state flag is also correctly off).Verification
Verified locally end-to-end against a fresh image build on a real Docker daemon: all 41 tests under
tests/docker/pass in 2m38s, including the two formerly-failing dashboard tests and the three sibling tests that were passing by accident.