Skip to content

[Sprint 5] Adjacent cleanups + end-to-end CI smoke#198

Merged
uzyn merged 3 commits into
mainfrom
user-mailbox-sprint-5
May 3, 2026
Merged

[Sprint 5] Adjacent cleanups + end-to-end CI smoke#198
uzyn merged 3 commits into
mainfrom
user-mailbox-sprint-5

Conversation

@uzyn

@uzyn uzyn commented May 3, 2026

Copy link
Copy Markdown
Owner

Summary

  • Rename AIMX_TEST_SKIP_ROOT_CHECK to AIMX_TEST_SKIP_AUTHZ_CHECK so the name describes what the env var actually skips: the authorize() call gating Action::HookCrud, not a generic root check. Clean break — no deprecation alias. CLAUDE.md test-env section updated atomically.
  • Finish the format_auth_error consolidation. Delete the format_hook_auth_error shim in src/hooks.rs; route the two inline pre-flight call sites and the socket-missing + non-root hook CRUD path through auth::format_auth_error directly with the appropriate AuthErrorContext. The canonical renderer in auth.rs is now the only authz-error rendering surface (grep -nE 'fn format_.*auth_error' src/ returns only the canonical renderer plus its own test fns).
  • Add three sudo-lane integration tests that collectively exercise all 12 MCP tools against a chmod 0600 root:root config + running aimx serve, with the MCP process spawned under runuser -u aimx-test-alice so the bug class (non-root MCP failing to read root-owned config) is in scope. The existing mailbox-dir-perms-isolation CI lane is extended with a new step that runs the three tests under sudo; no new CI lanes or infrastructure introduced.

Test plan

  • cargo fmt -- --check clean
  • cargo clippy --all-targets -- -D warnings clean
  • cargo test green on the default lane (1034 passed, 8 ignored)
  • Verifier crate cargo fmt -- --check + cargo clippy --all-targets -- -D warnings clean
  • New tests recognized by the test runner and gated #[ignore] behind AIMX_INTEGRATION_SUDO=1
  • Sudo-lane CI green when run (the three new tests exercise all 12 MCP tools end-to-end against a production-shape config)

Renames the test-env escape hatch AIMX_TEST_SKIP_ROOT_CHECK to
AIMX_TEST_SKIP_AUTHZ_CHECK so the name matches what it actually does
(bypass the authorize() call gating Action::HookCrud, not just a
root check). Clean break — no deprecation alias.

Finishes the format_auth_error consolidation. Deletes the
format_hook_auth_error shim in src/hooks.rs and routes both inline
authz errors and the socket-missing + non-root hook CRUD path through
auth::format_auth_error directly. The single canonical renderer in
auth.rs is now the only authz-error rendering surface.

Adds three sudo-lane integration tests that collectively exercise all
12 MCP tools against a chmod 0600 root:root config + running aimx
serve, with the MCP process spawned under runuser -u
aimx-test-alice so the bug class ("non-root MCP cannot read
root-owned config") is in scope:

  - mailbox_list_full_cycle_against_root_owned_config covers
    mailbox_list / mailbox_create / mailbox_delete
  - email_list_full_cycle_against_root_owned_config covers
    email_list / email_read / email_mark_read / email_mark_unread /
    email_send / email_reply
  - hook_list_full_cycle_against_root_owned_config covers
    hook_list / hook_create / hook_delete

Tests are gated behind both #[ignore] and AIMX_INTEGRATION_SUDO=1.
The existing mailbox-dir-perms-isolation CI lane is extended with a
new step that builds the integration test binary and runs the three
new tests under sudo. No new CI lanes or infrastructure introduced.

@uzyn uzyn left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sprint Goal Assessment

The sprint goal is met. The env-var rename (S5-1) is a clean break with no leftover references; format_auth_error is now the single rendering surface for every authz-error path in src/hooks.rs (S5-2); and three sudo-lane integration tests collectively exercise all 12 MCP tools against a chmod 0600 root:root config + running daemon, wired into the existing mailbox-dir-perms-isolation CI lane (S5-3). No new CI infrastructure introduced.

Acceptance Criteria Checklist

S5-1 — Rename AIMX_TEST_SKIP_ROOT_CHECKAIMX_TEST_SKIP_AUTHZ_CHECK

  • All read sites in src/hooks.rs use the new name (lines 163, 293).
  • All test fixtures use the new name (tests/integration.rs:4785, 5117, 5179).
  • CLAUDE.md "Test environment escape hatches" updated to describe what it actually skips (the authorize() call gating Action::HookCrud).
  • grep -rn 'AIMX_TEST_SKIP_ROOT_CHECK' returns zero matches across the repo.
  • cargo test green (1034 passed, 8 ignored on default lane).

S5-2 — Finish format_auth_error consolidation

  • src/hooks.rs::format_hook_auth_error deleted; both pre-flight call sites (create line 167, delete line 301) now call auth::format_auth_error directly.
  • The "socket missing + non-root" error path in both create (line 220) and delete (line 325) routes through the canonical renderer with AuthError::NotRoot + AuthErrorContext { surface: "aimx hooks", verb: "create"|"delete", resource: "resource" }.
  • grep -nE 'fn format_.*auth_error' src/ returns only the canonical renderer plus test functions.
  • User-visible wording for the pre-flight authz paths is unchanged (the existing unit tests hook_auth_error_* still assert on the rendered substrings and pass via the new inline rendering helper render_for_hooks).
  • [PARTIAL] Wording for the socket-missing + non-root path changed from "daemon not running, non-root hook CRUD requires daemon" to "not authorized: aimx hooks create requires root (run with sudo)". The AC says "unchanged where reasonable" — there is no test snapshot pinning the old wording, so this is permissible, but the new message loses the "daemon is down, restart it" hint that was in the old one. Implementer flagged this for a Sprint 6 docs callout; agreed — see non-blockers.
  • cargo test green.

S5-3 — End-to-end CI smoke against production-perm config

  • Three new tests in tests/integration.rs exercise every MCP tool against chmod 0600 root:root config + running aimx serve with the MCP process spawned under runuser -u aimx-test-alice:
    • mailbox_list_full_cycle_against_root_owned_config covers mailbox_list, mailbox_create, mailbox_delete (3)
    • email_list_full_cycle_against_root_owned_config covers email_list, email_read, email_mark_read, email_mark_unread, email_send, email_reply (6)
    • hook_list_full_cycle_against_root_owned_config covers hook_list, hook_create, hook_delete (3)
    • Total: 12 tools — matches PRD R44 exactly.
  • Tests gated behind AIMX_INTEGRATION_SUDO=1 + geteuid() == 0 + aimx-test-alice user existence + #[ignore] (defense-in-depth quadruple gate).
  • CI step "Run end-to-end production-perm MCP smoke (under sudo)" added to .github/workflows/ci.yml lines 112–125, runs only the three named tests with --test-threads=1, reusing the existing mailbox-dir-perms-isolation lane and aimx-test-alice user.
  • prodperm_assert_no_eacces asserts neither "Permission denied" nor "os error 13" appears in any tool response — that's the regression guard for the bug class.
  • No new CI lanes or infrastructure introduced.

Sprint 5 exit gate

  • cargo fmt -- --check clean (verified locally).
  • cargo clippy --all-targets -- -D warnings clean on both crates (verified locally).
  • cargo test green on default lane (1034 passed, 8 ignored, verified locally).
  • Sudo lane CI green — implementer flagged this as not yet exercised locally (no root + provisioned aimx-test-alice on dev host); first real exercise will be CI. Acceptable risk: the binary compiles, tests are recognized, and the skip-when-not-sudo path runs cleanly under non-root.

Test Coverage

Coverage is appropriate. Three observations on the sudo-lane tests:

  • hook_list_full_cycle_against_root_owned_config asserts the empty list as assert_eq!(text, "[]", ...). This is a tighter assertion than the others and assumes the formatter returns [] exactly (no whitespace, no JSON pretty-print). That matches mcp.rs::hook_list's current serde_json::to_string(&rows) output, but is more brittle than necessary. If a future change pretty-prints hook_list output the test will fail loudly — non-blocker, and arguably a good thing.
  • email_list_full_cycle_against_root_owned_config seeds an .md file with hand-rolled frontmatter (prodperm_seed_email). Several optional fields (in_reply_to, references) are written as empty strings rather than omitted; serde will deserialize these as Some("") rather than None. Not a bug, but a minor inconsistency with how the ingest path writes inbound mail (which omits empty optional fields). The test should still pass.
  • No assertion that the daemon's sent-copy persistence actually wrote a file after email_send. The test only asserts no EACCES surfaces; it does not look for the resulting file in sent/alice/. This is fine for the bug class being guarded (config-load EACCES on the MCP side), but a future regression where the daemon's sent-copy write silently no-ops would not be caught. Non-blocker.

The unit test rename in src/hooks.rs::tests (hook_auth_error_* from format_hook_auth_error_*) keeps the rendering invariants asserted via the local render_for_hooks helper. Good — the wording-regression guard for the pre-flight paths survives the consolidation.

Potential Bugs

None of crash/correctness concern. The implementation is mechanical and the new code paths are straightforward delegation to existing surfaces.

Two minor observations:

  • prodperm_setup_env useradd aimx-catchall discards the result. let _ = StdCommand::new("useradd")...status(); runs and ignores any failure. If the system lacks useradd or the user already exists with conflicting attributes, the test will silently proceed and may fail later in a less-obvious place. The pattern mirrors the rest of the file (CI-only sudo lane), so this is acceptable, but is worth noting if the test ever needs to run on a non-Debian-family host.
  • prodperm_spawn_mcp does not set AIMX_TEST_MAIL_DROP on the spawned MCP child. It does not need to — the env var is read by the daemon, not the MCP process — but the parameter shape (prodperm_setup_env separate from start_serve_with_env) means a future test author might add a transport-affecting env var to the MCP spawn instead of the daemon spawn. Minor maintainability concern only.

Security Issues

None. The PR makes no privilege-related changes:

  • Env var rename is a cosmetic name change with no semantic shift in what's gated.
  • format_auth_error consolidation moves rendering, not authz logic. The authorize() predicate and the is_root() gate are untouched.
  • New tests run only under AIMX_INTEGRATION_SUDO=1 + geteuid() == 0; the quadruple gate (cfg(unix) + #[ignore] + env var + euid check + id <user> check) is appropriately defensive.

Code Quality

  • The four inline AuthErrorContext constructions in src/hooks.rs::create / delete are identical across the create/delete pre-flight paths (modulo the verb string). The previous format_hook_auth_error shim was specifically rejected by the PRD ("delete the helper, route through canonical directly"), so the duplication is intentional. If a fifth call site appears, a tiny private helper at the module level would be reasonable — but as-is, four copies is fine.
  • The render_for_hooks test helper duplicates what the production code does inline. Pulling the production paths through it would defeat the AC, so this is the right tradeoff.
  • prodperm_* helpers are well-scoped to the new tests; naming is consistent.

Alignment with PRD

The PR addresses MP2-3 (adjacent cleanups: env-var rename + format_auth_error consolidation finish) and MP2-5's R44 leg (end-to-end CI smoke). The remaining MP2 work (R38–R42 doc drift, R45 release notes) is correctly deferred to Sprint 6 per the sprint plan dependency chain.

PRD NFR6 (clean break — no deprecation alias for the env-var rename) is honored. PRD §P2.10 PQ3 (extend the existing sudo lane vs. sibling job) — implementer chose to extend, matching the bias.

The flagged wording change on the socket-missing + non-root hook CRUD path is a minor PRD-format_auth_error-consolidation tradeoff: AC R37 says "User-visible wording stays unchanged where reasonable" and explicitly contemplates the consolidation may produce different text. The new wording is consistent with every other NotRoot surface (mailbox_create / mailbox_delete / mailbox_list_force etc.), which is the consolidation's whole point. Routing operators to "start the daemon" is a hint that could move to docs in Sprint 6 (R42 release-notes entry seems a natural home), as the implementer suggested.

Summary and Recommended Actions

Overall verdict: Ready to merge.

Blockers: None.

Non-blockers:

  • Sprint 6 docs (R42 release-notes entry, or book/hooks.md operator-troubleshooting section) should restore the "daemon is down — start it with sudo systemctl start aimx" hint that the previous error message carried, since the new canonical wording ("not authorized: aimx hooks create requires root (run with sudo)") tells the operator to use sudo but no longer mentions the underlying cause. Implementer already flagged this for Sprint 6.

Nice-to-haves:

  • email_list_full_cycle_against_root_owned_config could assert that a sent-copy file exists in sent/alice/ after email_send, to catch a future silent no-op regression.
  • prodperm_seed_email could omit in_reply_to / references (rather than emitting them as empty strings) to match how the ingest path writes inbound mail.
  • The four duplicated AuthErrorContext constructions in src/hooks.rs::create / delete are tolerable but could collapse to a tiny hook_auth_ctx(verb) helper if a fifth call site ever appears.

Recommended merge commit message:

[Sprint 5] Adjacent cleanups + end-to-end CI smoke — env var rename, format_auth_error consolidation, 12-tool prod-perm smoke

Renames `AIMX_TEST_SKIP_ROOT_CHECK` → `AIMX_TEST_SKIP_AUTHZ_CHECK` (clean break, no alias)
to match what the env var actually skips (the `authorize()` call gating
`Action::HookCrud`). Finishes the `format_auth_error` consolidation Sprint 3
started: deletes the `format_hook_auth_error` shim and routes the four
remaining authz-error rendering call sites in `src/hooks.rs` (two pre-flight,
two socket-missing) through the canonical renderer in `auth.rs`. Adds three
sudo-lane integration tests that collectively exercise all 12 MCP tools
against a `chmod 0600 root:root` config + running `aimx serve` (with the MCP
process spawned under `runuser -u aimx-test-alice`), closing the bug class
permanently — wired into the existing `mailbox-dir-perms-isolation` CI lane.

Address PR review feedback for the adjacent-cleanups branch:

- src/hooks.rs: on the socket-missing + non-root hook CRUD path, append
  a one-line "hint: if the daemon is running, hook CRUD over UDS would
  handle this without sudo." after the canonical `format_auth_error`
  rendering. Restores the daemon-down operator context that the
  pre-consolidation wording carried, without widening the canonical
  renderer's surface. New unit test pins the suffix wording.
- tests/integration.rs: `email_list_full_cycle_against_root_owned_config`
  now asserts that `email_send` produced at least one sent-copy `.md`
  under `sent/alice/`, so a future silent no-op in sent-copy persistence
  is caught even when EACCES does not fire.
- tests/integration.rs: `prodperm_seed_email` now omits the optional
  `in_reply_to` / `references` frontmatter fields rather than emitting
  empty strings, matching how the production ingest writer renders
  inbound mail.
@uzyn

uzyn commented May 3, 2026

Copy link
Copy Markdown
Owner Author

Review feedback addressed

Pushed b9554b4 in response to the latest review.

Fixed

  • Non-blocker — daemon-down hint restored on the socket-missing + non-root hook CRUD path. Took the recommended option (b): kept the canonical format_auth_error rendering as-is, then appended a one-line hint: if the daemon is running, hook CRUD over UDS would handle this without sudo. after the canonical first line on the socket-missing branch in both src/hooks.rs::create and src/hooks.rs::delete. Implemented as a tiny private append_daemon_down_hint(String) -> String helper at the call site so auth::format_auth_error and AuthErrorContext stay untouched. New unit test append_daemon_down_hint_preserves_canonical_message_and_appends_hint pins both the canonical-prefix invariant and the hint suffix wording so future drift is caught.
  • Nice-to-have — assert sent-copy file in email_list_full_cycle_against_root_owned_config. After email_send the test now reads sent/alice/ and asserts at least one .md lands there, closing the silent-no-op regression hole the reviewer flagged.
  • Nice-to-have — prodperm_seed_email no longer writes empty optional fields. Dropped the in_reply_to = "" / references = "" lines so the seeded frontmatter matches how the production ingest writer renders inbound mail (omit when empty).

Intentionally left as-is

  • Nice-to-have — AuthErrorContext helper consolidation. The reviewer explicitly said skip unless naturally touching all four call sites, and this fix only touches the two socket-missing sites (the helper wraps the rendered String rather than the AuthErrorContext). Four duplicate AuthErrorContext literals remain — fine until a fifth call site appears.

Verification

  • cargo fmt -- --check clean.
  • cargo clippy --all-targets -- -D warnings clean (both crates).
  • cargo test green: 1142 passed, 0 failed, 41 ignored on the default lane.
  • Banned-token sweep in additions: zero hits.

@uzyn uzyn left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — cycle 2 (commit b9554b4)

All three previously flagged items are resolved cleanly. No new issues introduced.

Resolved

  • Non-blocker — Daemon-down wording lost on socket-missing path — RESOLVED. Implementer applied option (b) verbatim: kept the canonical format_auth_error rendering as the first line, then appended hint: if the daemon is running, hook CRUD over UDS would handle this without sudo. via a tiny private helper append_daemon_down_hint(String) -> String at the call site in src/hooks.rs::create (line 223) and delete (line 328). auth::format_auth_error and AuthErrorContext stay untouched — the canonical renderer's surface is not widened. Verified by grep -rn 'append_daemon_down_hint' src/ tests/: only the two intended call sites, the definition, and the new unit test. New unit test append_daemon_down_hint_preserves_canonical_message_and_appends_hint pins both the canonical-prefix invariant (with_hint.starts_with(&canonical)) and the suffix wording (\nhint:, daemon, UDS substrings) so future drift is caught.
  • Nice-to-have — email_list_full_cycle_against_root_owned_config sent-copy assertion — RESOLVED. After email_send the test now reads sent/alice/ (line 6179–6195) and asserts at least one .md file lands there. A future silent no-op in sent-copy persistence is now caught even when the EACCES guard would not fire.
  • Nice-to-have — prodperm_seed_email empty optional fields — RESOLVED. The in_reply_to = "" / references = "" lines are dropped from the seeded frontmatter (line 6019–6020), matching how the production ingest writer renders inbound mail (omit when empty).

Intentionally left as-is

  • Nice-to-have — AuthErrorContext helper consolidation. The cycle-1 review explicitly said "skip unless naturally touching all four sites" and the daemon-down-hint fix only touches two of the four. Implementer's reasoning is correct: the helper wraps the rendered String, not the AuthErrorContext, so the four duplicate AuthErrorContext literals remain at the call sites — fine until a fifth one appears.

Still unresolved

None.

New issues found

None. The diff is mechanical (two call-site wrappings + one helper + one unit test in src/hooks.rs; one assertion block + two-line frontmatter trim in tests/integration.rs). No regressions, no new code paths.

Verification

  • cargo fmt -- --check: clean.
  • cargo clippy --all-targets -- -D warnings: clean.
  • cargo test default lane: 1143 passed, 0 failed, ignored counts match expectation (sudo-lane tests gated behind AIMX_INTEGRATION_SUDO=1 + geteuid()==0 + aimx-test-alice user existence + #[ignore]).
  • hooks::tests module: 7 passed (was 6 in cycle 1; the new test brings the count to 7).
  • Banned-token sweep on cycle-2 additions outside docs/: zero hits.
  • append_daemon_down_hint call-site audit: exactly the two intended branches, plus the definition and the unit test. No leakage to other surfaces.

Summary

The cycle-2 fixes are surgical and on-target. The daemon-down hint is restored without widening auth::format_auth_error, the sent-copy assertion closes the silent-no-op gap, and the seeded frontmatter now matches production. Clippy/fmt/test all green. Ready to merge.

Recommended merge commit message

[Sprint 5] Adjacent cleanups + end-to-end CI smoke — env var rename, format_auth_error consolidation, 12-tool prod-perm smoke

Renames `AIMX_TEST_SKIP_ROOT_CHECK` → `AIMX_TEST_SKIP_AUTHZ_CHECK` (clean break, no alias)
to match what the env var actually skips (the `authorize()` call gating
`Action::HookCrud`). Finishes the `format_auth_error` consolidation: deletes
the `format_hook_auth_error` shim and routes the four remaining authz-error
rendering call sites in `src/hooks.rs` (two pre-flight, two socket-missing)
through the canonical renderer in `auth.rs`. The socket-missing + non-root
branch appends a one-line daemon-down hint via a tiny private helper so
operators see why sudo is required without widening the canonical renderer's
surface. Adds three sudo-lane integration tests that collectively exercise
all 12 MCP tools against a `chmod 0600 root:root` config + running `aimx
serve` (with the MCP process spawned under `runuser -u aimx-test-alice`),
closing the bug class permanently — wired into the existing
`mailbox-dir-perms-isolation` CI lane. The email smoke now asserts a
sent-copy `.md` lands under `sent/alice/` after `email_send` so a future
silent no-op is caught even when EACCES does not fire.

… $HOME

The new sudo-lane CI step ran `runuser -u aimx-test-alice -- <target/debug/aimx>`
to spawn the MCP child as a non-root uid, but `target/debug/` on most CI
and dev hosts lives beneath a `0o750` `$HOME`. The alice uid (not in the
home owner's group) cannot traverse it, so `runuser` exits with
"Permission denied" before the MCP child ever runs — closing stdout
immediately and surfacing as the cryptic
`Error("EOF while parsing a value", line: 1, column: 0)` panic in
`McpClient::send_request`.

Fix the test setup by:

- Copying the `aimx` binary into the prodperm tempdir (already chmod'd
  `0o755` and rooted under `/tmp`, so always reachable from any uid)
  and pointing `prodperm_spawn_mcp` at that copy.
- Switching `AIMX_TEST_MAIL_DROP` from a directory back to a file path
  — `FileDropTransport::send` opens it with `OpenOptions::append(true)`
  and surfaces `Is a directory (os error 21)` otherwise. This was
  silent in the original failure because the EOF panic blocked the
  test before `email_send` ever ran.

Also wire a stderr drain into `McpClient` so future MCP-subprocess
deaths surface with the captured stderr + child exit status instead of
the opaque `EOF while parsing a value` panic. Every existing McpClient
construction site (six in total) now spawns a background stderr-drain
thread; `send_request` panics with the captured stderr block whenever
the read returns 0 bytes or the JSON parse fails.

Verified locally:

- `cargo test` — 1035 passed, 8 ignored on the default lane.
- `cargo clippy --all-targets -- -D warnings` clean.
- `cargo fmt -- --check` clean.
- `sudo AIMX_INTEGRATION_SUDO=1 <integration-bin> --ignored
  --test-threads=1 mailbox_list_full_cycle_against_root_owned_config
  email_list_full_cycle_against_root_owned_config
  hook_list_full_cycle_against_root_owned_config` — 3 passed.

@uzyn uzyn left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — cycle 3 (commit 4a34bba)

The sudo-lane CI failure flagged after cycle 2's approval is fixed. The fix is test-only (zero src/ changes), surgical, and grounds the original failure mode with a real diagnostic so future regressions surface clearly.

Resolved

  • Sudo-lane CI failure — EOF while parsing a value panic in three new prodperm MCP tests — RESOLVED. Root cause was runuser -u aimx-test-alice -- <target/debug/aimx>: target/debug/ lives under a 0o750 $HOME on CI/dev hosts, so the alice uid (not in the home owner's group) hits Permission denied on path traversal before aimx mcp ever execs. The MCP child stdin/stdout/stderr were already piped, so the child never being spawned looked exactly like an MCP child that crashed mid-handshake — the JSON-RPC reader saw an immediate EOF and panicked with serde_json::Error("EOF while parsing a value", line: 1, column: 0).

    Fix is a copy-the-binary-into-the-tempdir setup change (prodperm_copy_aimx_binary): the tempdir is 0o755 and lives under /tmp (always world-traversable from any uid). prodperm_spawn_mcp now points runuser at <tmp>/aimx instead of target/debug/aimx. Production code untouched.

  • Latent AIMX_TEST_MAIL_DROP directory bug in email_list_full_cycle_against_root_owned_config — RESOLVED. The cycle-2 test pointed AIMX_TEST_MAIL_DROP at a directory (tmp.path().join("mail_drop") + create_dir_all), but FileDropTransport::send opens the path with OpenOptions::create(true).append(true) (src/transport.rs:338-342), which surfaces Is a directory (os error 21). The cycle-2 EOF panic blocked email_send from ever running, so this was hidden. Reverted to a file path (tmp.path().join("outbound.log")), matching the convention in start_serve_with_mail_drop (line 2617) and tests/uds_authz.rs:417.

CI status (commit 4a34bba)

core-tests                          pass  3m14s
mailbox-dir-perms-isolation         pass  2m39s   ← previously failing; now green
verifier-tests                      pass  1m5s
docs-build                          pass  40s
Analyze (rust)                      pass  6m26s
Analyze (actions)                   pass  44s
Analyze (javascript-typescript)     pass  56s
CodeQL                              pass  2s

All eight checks green.

Diagnostic-quality improvements (bonus, also part of 4a34bba)

The implementer wired a background stderr drain into every McpClient construction site (six in total: McpClient::spawn, prodperm_spawn_mcp, plus four inline struct literals at lines 894, 973, 1282, 1360, 3072, 4565). When a future MCP subprocess dies before responding to a JSON-RPC request, send_request now panics with the captured stderr block + child exit status instead of the opaque Error("EOF while parsing a value", line: 1, column: 0). Specifically:

  • Read-line returns Ok(0)panic!("MCP subprocess closed stdout (EOF) before responding to {method:?}{}", self.format_dead_child_diag()).
  • serde_json::from_str parse fails → panic!("MCP returned non-JSON for {method:?}: {e}; raw line {raw:?}{diag}").
  • format_dead_child_diag waits up to 1s for try_wait() to return Some, joins the drain thread, then returns the captured stderr + exit status.
  • shutdown() joins the drain thread after child.wait() so no thread leaks.

This is a real maintainability improvement: the original failure took a careful read of runuser semantics and $HOME mode bits to diagnose. With this in place, the next regression of this class will land with a self-explanatory failure message.

A few small quality notes (all Non-blocker, leaving them as-is):

  • spawn_stderr_drain returns the drain thread handle but read errors are silently swallowed (Ok(0) | Err(_) => break). Acceptable for a test diagnostic — losing one error mid-stream is preferable to drain-thread panics polluting unrelated tests.
  • The drain reads into an unbounded Vec<u8>. No backpressure concern since real MCP children write little stderr before either succeeding or dying. Worst case is a few MB if a subprocess goes haywire; well within test memory budgets.
  • format_dead_child_diag mutates &mut self (takes the drain handle). If a test's send_request panics on EOF, then the panic is caught and the test continues, the diag will not be available a second time. The current call path immediately panics so this is fine; flagging only because the API would be confusing if anyone tried to reuse the McpClient post-failure.

Soundness assessment of the fix

  • Scope: test setup only. No src/ changes, no behavior changes to production code. Zero risk of regressing the cycle-2 approved surface.
  • Correctness of the binary-copy approach: the tempdir is created via TempDir::new() (mkdtemp under $TMPDIR or /tmp) and explicitly chmod'd 0o755 by prodperm_setup_env. Copying a 0o755 binary into it makes it reachable by any uid — confirmed correct.
  • runuser env passthrough: the implementer's comment claims AIMX_CONFIG_DIR / AIMX_RUNTIME_DIR survive runuser's env reset, and the MCP child finds the daemon socket via AIMX_RUNTIME_DIR. This is consistent with util-linux runuser behavior (only HOME/USER/MAIL/LOGNAME are rewritten with runuser -u; other env passes through). The CI green run on the same lane confirms it empirically.
  • Mail-drop file vs. directory: matches the existing convention in start_serve_with_mail_drop (line 2617). FileDropTransport opens with append(true) which fails on a directory — the fix is correct.
  • No regressions: the new stderr-drain plumbing is additive — existing tests get strictly better diagnostics, no new failure modes. The 1142+1035-passing default-lane runs across the cycle confirm no regression.

Still unresolved

None. The cycle-2 issues remain resolved (verified earlier), and the sudo-lane failure is now fixed.

New issues found

None.

Verification

  • Default-lane CI: core-tests pass (3m14s).
  • Sudo-lane CI: mailbox-dir-perms-isolation pass (2m39s) — previously the failing job.
  • Banned-token sweep on cycle-3 additions only: git diff b9554b4..4a34bba | grep '^+' | rg '\bSprint\b|\bS[0-9]+[-.][0-9]+\b|\bFR-[0-9]+[a-z]?\b|User Story|Acceptance criteria' → zero hits.
  • Diff scope: tests/integration.rs only, +185/-9 net. No src/ changes.
  • All 6 McpClient { ... } inline struct literals + spawn() + prodperm_spawn_mcp updated consistently with the new stderr_buf/stderr_drain fields.

Summary

Sound, surgical, test-only fix. The implementer correctly diagnosed a tricky CI-host-specific failure (alice uid + $HOME mode bits + opaque MCP EOF panic), fixed the immediate cause by copying the binary into a world-traversable tempdir, fixed a latent mail-drop directory bug that cycle-2's panic was masking, and improved the test harness so future regressions of this class surface with a real diagnostic instead of an opaque serde panic. CI is green on both lanes. Ready to merge.

Recommended merge commit message

[Sprint 5] Adjacent cleanups + end-to-end CI smoke — env var rename, format_auth_error consolidation, 12-tool prod-perm smoke

Renames `AIMX_TEST_SKIP_ROOT_CHECK` → `AIMX_TEST_SKIP_AUTHZ_CHECK` (clean break, no alias)
to match what the env var actually skips (the `authorize()` call gating
`Action::HookCrud`). Finishes the `format_auth_error` consolidation: deletes
the `format_hook_auth_error` shim and routes the four remaining authz-error
rendering call sites in `src/hooks.rs` (two pre-flight, two socket-missing)
through the canonical renderer in `auth.rs`. The socket-missing + non-root
branch appends a one-line daemon-down hint via a tiny private helper so
operators see why sudo is required without widening the canonical renderer's
surface. Adds three sudo-lane integration tests that collectively exercise
all 12 MCP tools against a `chmod 0600 root:root` config + running `aimx
serve` (with the MCP process spawned under `runuser -u aimx-test-alice`),
closing the bug class permanently — wired into the existing
`mailbox-dir-perms-isolation` CI lane. The email smoke now asserts a
sent-copy `.md` lands under `sent/alice/` after `email_send` so a future
silent no-op is caught even when EACCES does not fire. The MCP test client
gained a stderr-drain so MCP-subprocess deaths surface with the captured
stderr + exit status instead of an opaque `EOF while parsing a value`
panic.

@uzyn uzyn merged commit 379b077 into main May 3, 2026
8 checks passed
@uzyn uzyn deleted the user-mailbox-sprint-5 branch May 3, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant