fix: prevent duplicate hook fires on slow inbound deliveries#225
Conversation
A single forwarded email was producing multiple replies because slow
on_receive hooks held the SMTP DATA "250 OK" response for their entire
runtime. Remote MTAs (e.g. Gmail) treated the silent SMTP connection as
a delivery failure, retried, and aimx re-ingested + re-fired the hook on
every retry.
Fix in two layers:
1. Async hook execution
- New HookMode::{Sync, Async} on ingest_email.
- SMTP DATA handler now passes HookMode::Async so on_receive hooks
spawn into a detached std::thread and the "250 OK" goes back to
the remote MTA within seconds.
- aimx ingest CLI and all tests keep HookMode::Sync to preserve
end-to-end semantics (process / test must outlive the hook).
2. Message-Id dedup with 24h TTL (defense in depth)
- Per-mailbox SparKV-backed cache lives on a new MailboxState that
replaces the bare lock returned by MailboxLocks::lock_for.
- Lazily rebuilt from disk on the first ingest into a mailbox after
daemon start; entries loaded from frontmatter expire 24h after
their original received_at, so a daemon restart cannot stretch
dedup protection beyond its real window.
- Filename date-prefix pre-filter avoids opening files outside the
24h retention window.
- Duplicate Message-Ids are silently absorbed: no second .md, no
second hook fire, but ingest still returns Ok so the SMTP path
emits 250 OK and the remote MTA stops retrying.
Tests:
- Four new ingest dedup tests cover the cache hit, miss, lazy
disk rebuild, and TTL-expiry paths.
- Existing tests that intentionally re-used a fixture's hard-coded
Message-Id (file_naming_increments, rapid_same_subject_*,
duplicate_attachment_filenames_*, concurrent_same_subject_*,
thread_id_deterministic_for_same_message, and the SMTP
multi-transaction test) now generate per-call unique Message-Ids
via small helpers, matching real-world client behavior.
uzyn
left a comment
There was a problem hiding this comment.
Changes Overview
Two-layer fix for the "single forwarded email → multiple analyst replies" bug. Layer 1: a new HookMode::{Sync, Async} parameter on ingest_email lets the SMTP DATA path detach on_receive hooks onto std::thread::spawn so 250 OK returns in seconds rather than after the hook completes. Layer 2: a per-mailbox SparKV-backed Message-Id dedup cache with 24h TTL, lazily rebuilt from disk on first ingest per mailbox after daemon startup, silently absorbs duplicate inbound deliveries within the retention window. Lock map was refactored from Arc<AsyncMutex<()>> to Arc<MailboxState> so the dedup cache hangs off the same per-mailbox state that already serializes inbound writes.
Scope Alignment
The implementation matches the stated intent. The aimx ingest CLI path explicitly keeps HookMode::Sync (so a short-lived process doesn't exit before its hook), the SMTP DATA path uses HookMode::Async, and all 6 existing call sites of MailboxLocks::lock_for were correctly updated for the new return shape. TLS close_notify and an admin "purge dedup cache" command are explicitly deferred. Out-of-scope test churn (per-call unique Message-Id in fixtures that re-ingest the same email twice) is justified — without it, the new dedup would silently suppress the second ingest and the tests' assertions would no longer measure what they previously measured.
Potential Bugs
1. Auth verification still blocks 250 OK (Non-blocker)
verify_auth (DKIM/SPF/DMARC DNS lookups, including a fresh tokio::runtime::Runtime::new() per ingest) runs at src/ingest.rs ~line 245 before the dedup check and before the hook fork point. A slow recursive resolver or a pathological DKIM record can stall the spawn_blocking worker for many seconds — re-creating the same "remote MTA times out, retries" failure mode the PR is trying to fix, just with a smaller blast radius. The PR's hook-detach addresses the largest synchronous cost (multi-minute hook subprocesses), so this is a meaningful improvement on its own; calling it out so a follow-up isn't surprised when the bug surfaces again at a smaller scale.
2. 24h dedup TTL is shorter than the SMTP retry window (Non-blocker)
DEDUP_TTL = 24h in src/ingest.rs:110. RFC 5321 §4.5.4.1 sets the minimum give-up time at 4-5 days, and Gmail/Postfix defaults are in that range too. A legitimate retry past the 24h mark (rare but possible during prolonged inbound outages) will be treated as a fresh delivery and re-fire on_receive. This is a deliberate tradeoff per the PR description, and worth a one-liner in the constant's doc-comment ("we trade off correctness past 24h for a bounded cache size" or similar) so a future reader understands it's a knob, not an absolute.
3. Empty Message-Id bypasses dedup entirely (Non-blocker)
src/ingest.rs:170 gates dedup on !message_id.is_empty(). Real MTAs always set a Message-Id, but a hand-crafted SMTP client (or a buggy retry path on the sending side) that omits it would slip past dedup every retry and re-fire the hook each time. Worth either (a) falling back to a digest of (date, from, subject, raw[..N]) when Message-Id is missing, or (b) at minimum, an inline comment acknowledging the gap so the next person debugging "why did this hook fire 5×?" finds the answer in the file.
4. read_frontmatter_id_pair only inspects the first 4 KiB (Non-blocker)
src/ingest.rs:357: the disk-rebuild path reads at most 4 KiB. Typical inbound frontmatter is well under 1 KiB, but a long References chain plus many cc recipients (mailing lists) can push past 4 KiB. In that case find("\n+++") fails and read_frontmatter_id_pair returns None, so the entry is silently skipped from the rebuilt cache. The window in which this hurts is narrow (daemon restart + retry of an oversize message within 24h), but worth either bumping to 16 KiB or streaming line-by-line until +++ is found.
5. set_with_ttl failures are silent at the call site, with a real failure mode at scale (Non-blocker)
src/ingest.rs:215: if SparKV returns Err(MaxItemsExceeded) because we hit DEDUP_MAX_ITEMS = 100_000 within the 24h window, the warn-log fires but the Message-Id is not recorded. The very next retry of that message hits a cache miss and re-fires the hook. 100k/24h is ~1.2 inbound messages/second sustained per mailbox — absurd for any realistic aimx deployment, but worth noting because the failure mode silently regresses to "hook re-fires on every retry," which is the exact bug this PR fixes.
6. Catchall mailbox cache rebuild scope (Non-blocker)
When the recipient routes via the *@domain catchall, the dedup cache is keyed on the "catchall" mailbox name (MailboxLocks::lock_for("catchall")) and the disk scan walks <data_dir>/inbox/catchall/. That's correct for catchall-only deployments. For deployments that have both explicit mailboxes and a catchall, a Message-Id legitimately addressed to (say) alice@ and later re-delivered as unknown-recipient@ (different RCPT, same Message-Id) would NOT dedup — the two ingests land in different per-mailbox caches. Likely intentional (different recipients, different mailboxes), but worth a doc comment so it's not a future surprise.
Test Coverage
Strong coverage on the dedup side: ingest_skips_duplicate_message_id, ingest_allows_distinct_message_ids, ingest_dedup_cache_loads_from_disk, and ingest_dedup_cache_excludes_expired exercise the cache hit, miss, disk-rebuild, and TTL-expiry paths respectively. The disk-rebuild test pre-stages a stub .md with hand-written frontmatter rather than going through format_frontmatter — that's a small fragility (if the on-disk frontmatter format ever changes shape, this test won't notice), but acceptable for the cache-load behavior it's actually testing.
Gap on the async-hook side: the most consequential code change (detach the hook into std::thread::spawn so 250 OK returns fast) has zero targeted tests. There's no test that:
- registers a slow
on_receivehook (sleep ≥1s), - ingests a message through the SMTP DATA path,
- asserts
250 OKreturns in well under the hook duration, - and asserts the hook eventually runs.
The existing SMTP multi-transaction test (src/smtp/tests.rs) verifies the response content but not the response timing, so it would pass equally well with HookMode::Sync. This is the bug the PR exists to fix — it deserves a regression test. Non-blocker because the change is small and visually obvious, but adding it would close the loop.
Code Quality
src/ingest.rs:243-279: the async-hook spawn block is ~40 lines inline. Extractingfn spawn_on_receive_async(config, mailbox, md_path, meta)would shorten the match arm and make the failure-log site easier to find. Pure style, non-blocker.- The
Arc<MailboxState>refactor atsrc/mailbox_locks.rsis clean and the doc-comment update is on point. The newseen_message_ids_starts_uninitializedtest is appropriate. parse_toml_string_valueandread_frontmatter_id_pairare sufficient for the fixed-shape frontmatter aimx writes, but they're a hand-rolled subset of TOML. Ifformat_frontmatterever emits a multi-line string for, say, a very longsubject, the line-based parse will silently miss it. Not changed in this PR but worth flagging since this PR is the first reader of frontmatter that doesn't go throughserde/toml.
Summary and Recommended Actions
Overall verdict: Ready to merge with non-blocker follow-ups.
The two-layer fix is well-targeted at the root cause and the implementation is correct. Lock-hierarchy invariants are preserved, the dedup cache state lives behind the same per-mailbox serialization as inbound writes, and the cache insert is correctly ordered inside the lock so a concurrent retry can never see "on disk but not in cache." Existing tests were updated correctly (per-call unique Message-Ids in places that re-ingest the same fixture), and the new dedup tests are meaningful.
- Blockers: none.
- Non-blockers:
- No targeted regression test for the async-hook detach behavior (the primary fix).
verify_authstill synchronously holds the SMTP DATA path while doing DNS — same failure shape as the original bug, smaller window.DEDUP_TTL = 24his shorter than the RFC-allowed retry window; document the tradeoff in the constant's doc-comment.- Empty
Message-Idbypasses dedup; document or fall back to a content digest. - 4 KiB read cap in
read_frontmatter_id_paircan silently miss long-References frontmatter. - Silent
MaxItemsExceededfailure path at >100k entries/24h regresses to the original bug under extreme load.
- Nice-to-haves:
- Extract the async-hook spawn into a helper for readability.
- Add a doc comment on the catchall/explicit-mailbox dedup-isolation behavior.
Cycle 1B follow-ups on the slow-inbound dedup PR. No blocker fixes — the review verdict was "Ready to merge with non-blocker follow-ups." Code changes: - Bump read_frontmatter_id_pair scan window from 4 KiB to 16 KiB so a long References chain plus many cc recipients (common on mailing lists) cannot silently drop a rebuilt cache entry. - Extract the inline async-hook spawn block into spawn_on_receive_async() for readability and a shorter match arm. New tests: - src/smtp/tests.rs::test_smtp_data_returns_fast_with_slow_on_receive_hook: regression test for the primary fix. Registers a 3s on_receive hook, drives an SMTP DATA transaction, asserts 250 OK returns under 1.5s (well below the hook duration) and that the hook eventually fires. Verified to FAIL when HookMode is reverted to Sync. Doc-only changes (no behavior change): - DEDUP_TTL: spell out the RFC-retry-window tradeoff so future readers see the 24h value is a knob, not an absolute. - Dedup gate: inline note that an empty Message-Id bypasses dedup, with a pointer for the why-did-this-hook-fire-N-times debug path. - MaxItemsExceeded warn: explain the regression-to-original-bug shape and that DEDUP_MAX_ITEMS is the lever. - mailbox_locks: document that the dedup cache is per-mailbox by design (different RCPTs land in different caches even on the same Message-Id). - parse_toml_string_value / read_frontmatter_id_pair: flag the hand-rolled TOML-subset limits so a future schema change knows where to look. Deferred per review (each is its own non-blocker): - #1 (verify_auth still blocks 250 OK): out of scope for this PR; its own follow-up. Smaller-blast-radius failure mode than the multi-minute hook this PR fixed.
Cycle 1B: Review feedback addressedPushed Fixed in code
Regression test added
Closes the test-coverage gap the review flagged. Registers a 3-second
Sanity-checked: I temporarily reverted Deferred to a follow-up PR#1 —
Will file as a follow-up issue/PR. Verification
(Three pre-existing |
uzyn
left a comment
There was a problem hiding this comment.
Re-Review (cycle 1B, head 24a2291)
Diff scope: 155aef4..24a2291 — src/ingest.rs (+98/-37), src/mailbox_locks.rs (+13/-0), src/smtp/tests.rs (+156/-0). Build clean, clippy clean, fmt clean, new test passes locally (3.28s).
Resolved (verified against diff)
- #2
DEDUP_TTLdoc —src/ingest.rs:46-52. Doc-comment now spells out the RFC 5321 §4.5.4.1 4–5 day retry ceiling vs. the chosen 24h value and explicitly frames it as a tradeoff. Resolved. - #3 Empty
Message-Idbypass —src/ingest.rs:357-365. Inline comment at the dedup gate documents the gap, points a future debugger at the.eml, and justifies the deliberate no-content-digest-fallback choice ("would mask broken-sender bugs that an operator probably wants to see"). Reasonable call; matches the original review's "(a) or (b)" wording. Resolved as a doc-only acknowledgement. - #4 4 KiB → 16 KiB —
src/ingest.rs:683confirmslet mut buf = [0u8; 16 * 1024];. Doc-commentsrc/ingest.rs:671-688explains the rationale (longReferenceschains + mailing-listcc). Resolved. - #5
MaxItemsExceededsilence —src/ingest.rs:515-524. Inline comment now explains the regression-to-original-bug failure shape at >100k entries/24h and namesDEDUP_MAX_ITEMSas the lever.tracing::warn!retained as the operator signal. Resolved. - #6 Catchall keying —
src/mailbox_locks.rs:49-58. New "Per-mailbox keying, by design" module-doc section covers the alice-vs-catchall scenario and points at the migration path (process-wide cache) if cross-mailbox dedup ever becomes a requirement. Resolved. - Quality: async-spawn helper extraction —
src/ingest.rs:556-598.spawn_on_receive_async(config: &Config, mailbox: &str, md_path: &Path, meta: &InboundFrontmatter)replaces the ~40-line inline block. Clones (config.clone(),mailbox.to_string(),md_path.to_path_buf(),meta.clone()) and themailbox_name_for_logseparate clone for the error path are all preserved; the spawn-failure policy (log only, no fallback to sync) is intact. Equivalent behavior to the prior inline code. Resolved. - Quality: TOML parser limits doc-comment —
parse_toml_string_value(src/ingest.rs:741-749) andread_frontmatter_id_pair(src/ingest.rs:684-695) both call out the hand-rolled-TOML-subset constraints and tell the next reader to "swap this for a realtomlcrate call rather than expanding the regex-in-disguise" iffrontmatter.rsever grows multi-line strings. Resolved. - Test-coverage gap: async-hook detach regression —
src/smtp/tests.rs:1626-1690test_smtp_data_returns_fast_with_slow_on_receive_hook. Verified independently:- Registers a 3s sleep hook on
alice(/bin/sh -c "sleep 3 && touch <marker>"). - Drives the full SMTP DATA path: EHLO → MAIL FROM → RCPT TO → DATA → body →
.terminator → QUIT. Not a directingest_emailcall. - Asserts (1)
250 OKafter dot-terminator returns in <1500ms (HOOK_SLEEP_SECS * 1000 / 2) and (2) the marker file eventually appears withinHOOK_SLEEP_SECS + 10seconds. (2) is what catches the "spawn silently dropped the closure" regression that (1) alone would miss. - Sets
AIMX_SANDBOX_FORCE_FALLBACK=1viaunsafe { set_var }so the hook runs without root /systemd-run. The var is process-wide butsrc/hook.rs:665already sets it the same way, so this doesn't introduce a new pollution pattern. - Sync would block ≥3000ms (implementer reports 3119ms with
HookMode::Sync). The 1500ms vs 3000ms separation is an order-of-magnitude gap, not a tight margin — robust against CI noise. - Belt-and-braces
assert_eq!(md_files.len(), 1)at the end pins that ingestion itself still completed. - Local run: passes in 3.28s. Resolved.
- Registers a 3s sleep hook on
Deferred (accepted)
- #1
verify_authstill blocks250 OK— kept out of scope per the implementer's top-level comment. I agree with the deferral. The synchronoustokio::runtime::Runtime::new()plus DKIM/SPF/DMARC DNS lookups atsrc/ingest.rs:420-465still gate250 OKfor several seconds on a slow recursive resolver, but the failure radius is DNS-bounded, not user-script-bounded — meaningfully different from the multi-minute hook this PR fixed. A proper fix wants its own focused PR (decide betweenspawn_blocking+ hard timeout vs. full async detach, both with retention of trust-evaluation semantics for thefrontmatter.trustedfield). Track as a follow-up issue.
Still unresolved
None.
New issues introduced by the cycle 1B changes
None.
- The
spawn_on_receive_asyncextraction is a faithful refactor — the only structural change ismailbox: &strtaking ownership viato_string()instead ofmailbox.clone()on aStringreference, which is semantically identical. All clones in the closure (config_owned,mailbox_name,mailbox_name_for_log,md_path_owned,meta_owned) are correctly owned andmoved. - The
unsafe { set_var }block in the new test is unavoidable on Rust 2024 edition; the safety comment is accurate. - The 16 KiB read cap is a strict superset of the prior 4 KiB behavior — no regression risk.
Scope creep check
None. Diff is exactly the 7 prior non-blockers + the requested regression test. No unrelated refactors.
Banned-token check (per CLAUDE.local.md)
PR-touched files (src/ingest.rs, src/mailbox_locks.rs, src/smtp/tests.rs, src/smtp/session.rs, src/hook_handler.rs, src/mailbox_handler.rs, src/state_handler.rs) introduce zero hits for \bSprint\b, \bS\d+[-.]\d+\b, \bFR-\d+[a-z]?\b, "User Story", "Acceptance criteria". (Pre-existing tokens in src/mailbox_handler.rs lines 691/1585/1767 are untouched by this PR and out of scope.)
Summary and verdict
Overall verdict: Ready to merge.
All six prior non-blockers are addressed in code or in doc-comments exactly as the review asked. The deferred #1 is justified, scoped, and the implementer has committed to a follow-up. The async-hook regression test is well-designed: it drives the full SMTP DATA path (not just ingest_email), asserts both timing AND eventual hook execution, has a 2× margin so CI noise can't flake it, and the implementer's sync-mode sanity check (3119ms vs 1500ms) confirms it actually fails under regression. No new bugs, no scope creep, no banned tokens, no test-quality concerns.
- Blockers: none.
- Non-blockers: none.
- Nice-to-haves: none (the follow-up for
verify_authis already tracked in the implementer's comment).
Recommended merge commit message
fix: prevent duplicate hook fires on slow inbound deliveries
Two-layer fix for the "single forwarded email triggers multiple hook
fires" bug.
Layer 1 (root cause): new HookMode::{Sync, Async} parameter on
ingest_email. The SMTP DATA handler passes Async, which detaches
on_receive hooks onto std::thread::spawn so 250 OK returns in seconds
rather than after the hook completes. The aimx ingest CLI keeps Sync so
the short-lived process doesn't exit before its hook.
Layer 2 (defense in depth): per-mailbox in-memory Message-Id cache
(sparkv-backed, 24h TTL, auto-eviction). Lazily rebuilt from on-disk
frontmatter on first ingest per mailbox after daemon startup. Filename
date-prefix pre-filter keeps the rebuild cheap. Duplicate Message-Ids
within 24h are silently absorbed: no second .md, no second hook fire,
but ingest still returns Ok so the SMTP path emits 250 OK and the
remote MTA stops retrying.
Tests: 4 new dedup unit tests (cache hit, distinct ids allowed, lazy
disk rebuild, expired entries excluded) plus a new SMTP-path regression
test (test_smtp_data_returns_fast_with_slow_on_receive_hook) that pins
the timing contract by registering a 3s hook and asserting 250 OK
returns in well under that duration. Existing tests that re-ingested
the same fixture twice were updated to emit per-call unique Message-Ids.
Summary
on_receivehooks ran synchronously insideingest_email, holding the SMTP250 OKreply hostage for the entire hook duration (~7 min in the reported case). Gmail's MTA timed out around 5–7 min, considered delivery failed, retried — and aimx had noMessage-Iddedup, so each retry ingested the message again and re-fired the hook.HookMode::{Sync, Async}parameter oningest_email. The SMTP DATA handler passesAsync, which detaches hooks ontostd::thread::spawnso250 OKreturns in seconds. Theaimx ingestCLI and all existing tests keepSync.Message-Idcache (backed bysparkvwith 24h TTL + auto-eviction). Lazily rebuilt from on-disk frontmatter on first ingest after daemon startup. Entries loaded from disk inherit24h − (now − received_at)as their TTL, so a daemon restart cannot stretch dedup protection past its real lifetime. Filename date-prefix pre-filter avoids opening files outside the 24h window. Duplicate Message-Ids are silently absorbed (no second.md, no second hook fire) butingest_emailstill returnsOkso the SMTP path emits250 OKand the remote MTA stops retrying.Files changed
Cargo.tomlsparkv0.1 depsrc/mailbox_locks.rslock_fornow returnsArc<MailboxState>(lock + dedup cache)src/ingest.rsHookModeenum, dedup check + cache insert, async hook spawn branch, disk-scan helpers, 4 new testssrc/smtp/session.rsHookMode::Asyncsrc/state_handler.rs,src/hook_handler.rs,src/mailbox_handler.rssrc/smtp/tests.rsMessage-IdsTest plan
cargo buildcleancargo clippy -- -D warningscleancargo fmt -- --checkcleancargo test --bin aimx— 1145 pass; 3 macOS-only pre-existinguds_authzfailures (Linux-only project per CLAUDE.md)cargo test --test integration— 116/116 passreceived email+ singlefiring hook_name=...log line, and a single reply lands in inboxOut of scope
close_notifygraceful shutdown insrc/smtp/mod.rs— Gmail's "peer closed without close_notify" is a downstream symptom of the timeout. Once250 OKlands within seconds (this PR), Gmail closes cleanly. If it persists after deploy, a follow-up to calltokio_rustls::shutdown().awaitbefore the TCP shutdown is straightforward.