feat: add --port-check-only flag to install.sh#185
Conversation
Run the same outbound + inbound port-25 checks as `aimx portcheck`, then exit without installing. Lets evaluators verify a VPS can reach SMTP before committing to install/setup: curl -fsSL https://aimx.email/install.sh | sh -s -- --port-check-only - New --port-check-only / --verify-host flags + AIMX_VERIFY_HOST env - Outbound EHLO ladder: python3 -> nc -> bash /dev/tcp - Inbound /probe via curl with loose "reachable":true substring match - Non-root inbound -> [skip] (does not affect exit code) - Root + free port -> spawn temp Python SMTP listener (mirrors src/portcheck.rs:69-129) - Banner switches to "aimx port 25 connectivity check" - Exits 0/1/2 (pass-or-skip / fail / missing tool) - Branches in main() before ensure_sudo / network / fs writes - POSIX sh; sh -n passes; tests/install_sh.sh extended with parse_args + verify-host validation + banner/help assertions - AIMX_DRY_RUN=1 unchanged
uzyn
left a comment
There was a problem hiding this comment.
Standalone PR Review — PR #185 feat: add --port-check-only flag to install.sh
Changes Overview
Adds a --port-check-only mode to install.sh that runs the same outbound + inbound port-25 connectivity checks as the Rust aimx portcheck subcommand, then exits without installing anything. Adds a --verify-host flag and AIMX_VERIFY_HOST env var (default https://check.aimx.email). Outbound EHLO uses a python3 → nc → bash /dev/tcp ladder; inbound uses a curl GET against ${VERIFY_HOST}/probe with a loose "reachable":true substring match. Non-root operators get [skip] for inbound (does not affect exit code). tests/install_sh.sh is extended with parse_args, banner, validate_verify_host, and derive_smtp_host assertions.
Scope Alignment
The implementation matches the PR description well. All 13 requirement boxes are checked off, the short-circuit lives in main() before ensure_sudo/resolve_sudo_prefix/network/fs writes, the exit-code contract (0/1/2) is implemented, the banner switch is in place, and AIMX_DRY_RUN=1 (without --port-check-only) is verified to leave existing behavior unchanged. Out-of-scope items (no separate chkport25.sh, no IPv6, no auth) are honored. POSIX-sh compliance verified via sh -n install.sh.
Potential Bugs
1. Listener bind race not surfaced — Non-blocker
port_check_listener_start (lines 721-761) backgrounds python3 - <<'PYEOF' & and immediately captures $!. If s.bind(('0.0.0.0', 25)) fails (race with another binder, EACCES on some configs, etc.), the python child exits 1 immediately. The script then sleep 1 and calls /probe against an unreachable port — the verifier reports unreachable and the operator sees a generic "verifier reported port 25 unreachable" message instead of "tried to spawn a temp listener but it never bound". port_check_listener_stop is then a no-op (kill -0 returns false), so cleanup is OK but the diagnostic is poor. The Rust equivalent in src/portcheck.rs:44-53 does a synchronous bind and surfaces an explicit error before continuing. Worth a short post-spawn liveness probe (kill -0 ${_pid} after sleep 1) with a clearer message when the listener died.
2. Listener loop exits on first idle window — Non-blocker
port_check_listener_start's python loop (lines 731-739) breaks the entire while time.time() < deadline loop on the first socket.timeout from accept(). After serving one connection, if no second connection arrives within the remaining ~25 sec timeout, the listener exits. In practice /probe only does one EHLO probe so this works, but the Rust equivalent (src/portcheck.rs:73-82) keeps looping until the shutdown signal — a stricter parity claim than what the script delivers. If the verifier ever retries (e.g., one TCP probe + one EHLO probe), the second call hits a closed socket. Confirm /probe is single-attempt before merge or change the Python break on socket.timeout to continue (with the outer deadline check still bounding the loop).
3. Listener bind delay = fixed sleep 1 — Non-blocker
On slow VPS / cold cache, python3 -c '...' may not bind within one second. There is no readiness probe (e.g., a quick ss -tln | grep ':25' poll loop). Best-effort, but on a slow first-run the inbound check can flap. Acceptable for the v1 of this feature; document or harden later.
4. Occupancy detection ignores who's holding port 25 — Non-blocker, but a UX gap
port_check_detect_occupancy returns "occupied" when ss/netstat sees something on :25. The script then assumes "Existing daemon will reply to /probe" (line 808). If the operator already has Postfix/Sendmail/Exim listening, /probe will succeed (Postfix replies 220 to the verifier's EHLO) and the script reports [ok], suggesting aimx will install cleanly. The Rust aimx portcheck distinguishes this case (Port25Status::OtherProcess(name) → error with the process name). The shell script doesn't, so an operator can pass --port-check-only on a Postfix-managed VPS and walk away thinking they're ready, then have aimx setup fail later on the same port. The PR description doesn't promise this parity, so it's a non-blocker, but a one-line warning when occupancy is "occupied" would close the gap (e.g., "port 25 is held by another process; verify it's aimx before running setup").
5. Missing curl exits with code 1 instead of 2 — Non-blocker
port_check_main calls need curl (line 845), which exits 1 via err. The PR description's exit-code contract says 2 is for "missing required tool". curl is a required tool here. Minor inconsistency with the documented contract.
6. validate_verify_host runs unconditionally on every install — Non-blocker, mild UX surprise
main() runs resolve_verify_host + validate_verify_host (lines 1031-1032) before the --port-check-only short-circuit. A regular install with AIMX_VERIFY_HOST=garbage exported in the operator's shell will now fail with "verify-host must start with http:// or https://" before any install side effect — even though the env var is documented as "Verifier base URL for --port-check-only" only. Either gate the validate call behind ${PORT_CHECK_ONLY}=1, or extend the help text to note the env var also gates the install path. Mild, but worth a thought.
7. port_check_outbound_nc may hang on some nc variants — Non-blocker, comment correctly admits the constraint
The fallback explicitly avoids -q/-N/-c for portability. On some nc implementations (BusyBox, certain Solaris-derived BSD-nc) the connection won't close after stdin EOF, causing nc to wait for the server to close the TCP half. With nc -w 5 this caps at 5 seconds, so worst-case an extra 5-second wait per fallback invocation. Acceptable; the comment in source already flags this.
Security Issues
None. The script is pure read/probe behavior; no fs writes in port_check_main, no privileged ops, no shell injection vectors (verify-host is parameter-substituted into curl args, never eval'd; the python heredoc passes host/port as sys.argv not via string interpolation; bash heredoc passes them as positional args). The HTTPS-only trust anchor is honored when scheme is https; http is permitted only as documented.
Test Coverage
Covered (good):
parse_argsfor--port-check-only,--verify-host(space + equals form), and missing-value error pathvalidate_verify_hostrejectsftp://resolve_verify_hostprecedence: flag > env > default- Default
DEFAULT_VERIFY_HOSTvalue --helpmentions new flag, env var--port-check-only --helpexits 0 and prints helpderive_smtp_hoststrips scheme, path, portprint_port_check_bannerprints correct title and "no install" notice- Negative cases:
--tag/--todoes not setPORT_CHECK_ONLY;AIMX_DRY_RUN=1(without --port-check-only) does not switch to port-check banner
Not covered (explicitly out of scope, but worth noting):
port_check_outbound_python/nc/bash— no fake-server harness; correctness relies on manual smokeport_check_inboundorchestration matrix (non-root → skip, root + occupied → /probe, root + free → spawn listener)port_check_mainexit codes 0/1/2 for the actual end-to-end paths- The "no outbound tool on PATH" branch returning 2 with the hint message
- IP extraction from
/probebody (sed parser)
These would require either a fake SMTP server fixture or a mock /probe endpoint — fair to defer, but the manual smoke note in the test plan is the only safety net for the network-touching branches.
Code Quality
Non-blocker — the _inbound_ok variable mixes numeric ("1", "0") and string ("skip") sentinel values. Using [ "${_inbound_ok}" = "1" ] (string compare) for the success path is fine, but the comparison style is asymmetric with _outbound_ok which uses -eq (numeric). Consider a single string-typed _inbound_state of pass | fail | skip for symmetry; or normalize both to strings.
Non-blocker — duplication between port_check_outbound_python and port_check_listener_start Python heredocs. Both implement near-identical SMTP state machines (220 banner, 250 EHLO/HELO, 502 on unknown). Each is small and the duplication is between roles (client vs server), so this is fine — flag for awareness only.
Documentation
Non-blocker — user-facing docs not updated. Per CLAUDE.md: "When making changes that affect CLI behavior, setup output, or MCP tools, update the corresponding guide files too." The new --port-check-only flag is a new public-facing entrypoint (the curl one-liner is in the PR summary), but:
website/public/index.html— still only showscurl -fsSL https://aimx.email/install.sh | sh. The new pre-install verification one-liner could be surfaced near the existing "You need sudo — port 25 is privileged" sentence.book/— no mention of--port-check-onlyin any chapter (installation, troubleshooting, getting-started). The installation chapter is the natural home.
This is the most actionable follow-up. Either fold a one-paragraph mention into this PR or open an explicit follow-up note in the description.
Summary and Recommended Actions
Overall verdict: Needs minor fixes
The implementation is clean, the scope is met, the test coverage of pure-shell helpers is solid, and the security posture is sound. Bugs identified are non-blockers but accumulate in three areas: listener-lifecycle robustness (#1, #2, #3), parity with the Rust portcheck's port-conflict warning (#4), and documentation (the docs gap). None of these can crash, corrupt data, or break the install path — they're operator-experience refinements.
Blockers (must fix before merge): none.
Non-blockers (should fix but not blocking):
- (#1) Surface listener-bind failures explicitly instead of silently relying on
/probeto fail. - (#2) Confirm
/probeis single-attempt — if it ever retries, the listener loop'sbreak-on-first-timeoutwill drop the second connection. - (#4) Warn when port 25 occupancy is "occupied" by a non-aimx process, matching
aimx portcheck'sPort25Status::OtherProcesshandling. - (#5) Use exit code 2 (not 1) when curl is missing, to match the documented contract.
- (#6) Either gate
validate_verify_hostbehindPORT_CHECK_ONLY=1or document thatAIMX_VERIFY_HOSTaffects the install path too. - Update
website/public/index.htmlandbook/to mention the new--port-check-onlyflag.
Nice-to-haves (low priority):
- Replace fixed
sleep 1with a short readiness poll on:25. - Normalize
_outbound_ok/_inbound_okto a single string-state convention. - Add an integration-style test that fakes a
/probeendpoint (e.g., python -m http.server with a stub) to exercise the inbound state matrix.
- listener: continue (not break) on socket.timeout so a second probe
attempt within the deadline still gets served. Mirrors the
shutdown-bound loop in src/portcheck.rs:73-82.
- listener: liveness probe (kill -0 ${_listener_pid}) after the bind
delay surfaces a distinct "failed to spawn temp listener on :25"
error instead of letting /probe report a generic "unreachable".
- occupancy: warn the operator when port 25 is held by another
process — /probe can't distinguish aimx from Postfix/Exim, so a
green [ok] against an existing daemon was misleading. Mirrors
Port25Status::OtherProcess in src/portcheck.rs.
- missing curl now exits 2 (was 1) per the documented contract:
0 pass-or-skip, 1 fail, 2 missing required tool. Replaced
`need curl` in port_check_main with an explicit check that exits
the right code.
- AIMX_VERIFY_HOST is now resolved + validated only when
--port-check-only is set. A garbage value in the operator's shell
no longer fails regular installs (the env var was never relevant
to the install path).
- docs: surface --port-check-only in website/public/index.html (Quick
start), add a Pre-install subsection in book/getting-started.md,
and cross-reference from book/setup.md and book/troubleshooting.md.
- tests: 14 new assertions covering env-var gating, listener loop
shape, listener liveness probe, occupancy warning, and
missing-curl exit code (both source-grep and live PATH-scrub).
Cycle 2 — review feedback addressedPushed Fixed in code
Tests
84/86 assertions pass. The 2 fails are the pre-existing shellcheck Intentionally left as-is (deferred per task brief)These three were called out in the review as nice-to-haves and the task brief listed them as deferrable:
Validation run locally
|
uzyn
left a comment
There was a problem hiding this comment.
Cycle 2 re-review — APPROVED
The implementer pushed 4cc788f addressing all 6 non-blockers from cycle 1. I verified each fix in the diff and ran the test suite locally (84/86 pass; the 2 failures are the pre-existing shellcheck SC2024 finding on sudo -v </dev/tty at install.sh:251, present on main and untouched by this PR).
Resolved issues
(#1) Listener bind failures silent — RESOLVED. port_check_inbound now does if ! kill -0 "${_listener_pid}" 2>/dev/null after the bind delay and surfaces a distinct failed to spawn temp listener on :25 (bind failed; another binder won the race?) message. _listener_pid="" is set so the cleanup path correctly skips a no-op kill. Asserted by source-grep tests.
(#2) Listener loop break-on-first-timeout — RESOLVED. break on socket.timeout is now continue. The outer while time.time() < deadline still bounds the loop, and the inner except Exception: break for non-timeout errors is preserved. Two assertions confirm both the presence of continue and the absence of break in the timeout handler.
(#4) Occupancy detection silently treats foreign processes as aimx — RESOLVED. ui_warn "port 25 is held by another process; verify it's aimx before running setup" fires in the occupied branch before /probe runs. The warning ordering is correct (emitted before the inbound result line), though the warning interrupts the partial ' Inbound port 25 .... ' line — a minor cosmetic glitch, not a blocker.
(#5) need curl exits 1 instead of 2 — RESOLVED. Replaced with explicit command -v curl >/dev/null 2>&1 || return 2 and a clear error message. Verified by both source-grep (no need curl in port_check_main) and a live PATH-scrub test that asserts rc=2.
(#6) validate_verify_host ran unconditionally — RESOLVED. resolve_verify_host + validate_verify_host moved inside the if [ "${PORT_CHECK_ONLY}" = "1" ] short-circuit. Verified locally:
AIMX_VERIFY_HOST=garbage AIMX_DRY_RUN=1 sh install.sh --target x86_64-unknown-linux-gnu --tag 0.1.0→ exit 0 (regression check).AIMX_VERIFY_HOST=garbage sh install.sh --port-check-only→ exit 1 with scheme error.
(#7) Docs not updated — RESOLVED.
website/public/index.htmlQuick start now shows the--port-check-onlycurl one-liner alongside the standard install command.book/getting-started.mdhas a new "Pre-install: check port 25" subsection covering the one-liner, sudo behavior, exit codes, and--verify-host/AIMX_VERIFY_HOST.book/setup.mdandbook/troubleshooting.mdcross-reference the pre-install check.
Still unresolved
None.
New issues from cycle-2 fixes
None. I checked specifically for:
- Cleanup leak when bind-liveness fails — guarded by
[ -n "${_listener_pid}" ], no spurious kill. - Warning ordering —
ui_warninterrupts the partial' Inbound port 25 .... 'line; cosmetic only, not a blocker. - Continue-on-timeout regression — the inner
except Exception: breakpreserves the original failure exit for non-timeout errors. - Test PATH isolation — the live PATH-scrub test correctly seeds
iso-bin-no-curlwith the minimum tool set; no false-positive risk.
Explicitly deferred (acknowledged as non-blockers, not required for merge)
- Fixed
sleep 1→ readiness poll on:25. The newkill -0liveness probe catches the most common bind-failure mode that a readiness poll would address; harden later if slow-VPS reports come in. - Normalize
_outbound_ok(numeric) /_inbound_ok(mixed) sentinel conventions. Pure cosmetic, no behavior change. - Integration tests with a fake
/probeendpoint. The live PATH-scrub test exercises one end-to-end branch (curl-missing → exit 2) without that scaffolding; broader fixture is a follow-up.
Local validation run
sh -n install.sh— POSIX OKsh tests/install_sh.sh— 84/86 (pre-existing SC2024 baseline failures only)sh install.sh --port-check-only --help— exit 0AIMX_VERIFY_HOST=garbage AIMX_DRY_RUN=1 sh install.sh ...— exit 0 (#6 regression check)AIMX_VERIFY_HOST=garbage sh install.sh --port-check-only— exit 1 with scheme error
Verdict: Ready to merge
All cycle-1 non-blockers are resolved with focused, minimal changes; no new bugs introduced; tests cover the new behavior; documentation surfaces the new entrypoint in the right places.
Recommended merge commit message
feat(install.sh): add --port-check-only flag
Run the same outbound + inbound port-25 connectivity checks as
`aimx portcheck`, then exit without installing. Lets evaluators
verify a VPS can reach SMTP before committing to install/setup:
curl -fsSL https://aimx.email/install.sh | sh -s -- --port-check-only
- New --port-check-only / --verify-host flags + AIMX_VERIFY_HOST env
(verify-host validation gated behind --port-check-only so a stray
env var in the operator's shell doesn't block regular installs).
- Outbound EHLO ladder: python3 -> nc -> bash /dev/tcp.
- Inbound /probe via curl with loose "reachable":true substring match.
- Non-root inbound -> [skip] (does not affect exit code).
- Root + free port -> spawn temp Python SMTP listener with kill -0
liveness probe surfacing distinct bind-failed error (mirrors
src/portcheck.rs:44-53).
- Listener loop continues on socket.timeout so the deadline bounds
the loop while a second probe within the window is served.
- Root + occupied -> ui_warn + /probe (warns the green [ok] is only
meaningful if the holder is aimx, mirroring Port25Status::OtherProcess).
- Banner switches to "aimx port 25 connectivity check".
- Exits 0/1/2 (pass-or-skip / fail / missing tool); explicit curl
check returns 2 per the documented contract.
- Branches in main() before ensure_sudo / network / fs writes.
- POSIX sh; sh -n passes; tests/install_sh.sh extended with parse_args
+ verify-host validation + banner/help + listener loop shape +
liveness probe + occupancy warning + missing-curl-exits-2 assertions.
- Docs: --port-check-only surfaced on website Quick start, new
Pre-install subsection in book/getting-started.md, cross-refs in
book/setup.md and book/troubleshooting.md.
- AIMX_DRY_RUN=1 unchanged.
Summary
Add
--port-check-onlyflag to install.sh that runs the same outbound + inbound port-25 checks asaimx portcheck, then exits without installing. Lets evaluators verify a VPS can reach SMTP before committing to install/setup.One-liner:
curl -fsSL https://aimx.email/install.sh | sh -s -- --port-check-onlyRequirements
--port-check-onlyflag short-circuits before install path--verify-host <URL>flag +AIMX_VERIFY_HOSTenv var override defaulthttps://check.aimx.email--helptext lists new flag, env var, and a curl-pipe exampleRealNetworkOps::check_outbound_port25(src/setup.rs:829-876)/probecall via curl with loose"reachable":truesubstring match (src/setup.rs:774-787)[skip](does NOT fail exit code); root + occupied → call /probe directly; root + free → spawn temp Python listener; root + free + no python3 →[skip]sh -n install.shpassesAIMX_DRY_RUN=1behavior unchangedOut of scope
chkport25.sh); folded into install.sh by explicit decisionaimx portcheckRust subcommandTechnical approach
PORT_CHECK_ONLY=0,VERIFY_HOST="",DEFAULT_VERIFY_HOST="https://check.aimx.email".--port-check-only,--verify-host, and--verify-host=<URL>cases.port_check_mainbefore any privileged or networked work.check_outbound_port25byte-for-byte (220 banner, EHLO aimx, consume 250- continuations, accept on 250 SP, QUIT).with_temp_smtp_listenerin src/portcheck.rs:69-129); nc -l semantics vary too widely between BSD/GNU/ncat to support reliably."reachable":truesubstring match (same as src/setup.rs:783) plus a sed extraction for theipfield.Test plan
sh -n install.sh(POSIX syntax)sh tests/install_sh.sh(existing + new assertions all pass)sh install.sh --port-check-only --helpexits 0sh install.sh --help | grep -q port-check-onlyexits 0AIMX_DRY_RUN=1 sh install.shstill produces the existing dry-run output (no port check)sh install.sh --port-check-onlynon-root → outbound runs, inbound [skip];sudo sh install.sh --port-check-onlyon a port-25-open VPS → both [ok].github/workflows/site.yml:51already runssh -n website/dist/install.shaftermake build; no change needed thereNotes
validate_verify_hostin src/setup.rs:790-801).&, PID captured, killed after /probe returns; SO_REUSEADDR set so re-runs don't TIME_WAIT-block.