Skip to content

fix: prevent duplicate hook fires on slow inbound deliveries#225

Merged
uzyn merged 2 commits into
mainfrom
worktree-multi-bug-fix
May 11, 2026
Merged

fix: prevent duplicate hook fires on slow inbound deliveries#225
uzyn merged 2 commits into
mainfrom
worktree-multi-bug-fix

Conversation

@uzyn

@uzyn uzyn commented May 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • Fixes the bug where a single forwarded email triggered multiple hooks. Root cause: on_receive hooks ran synchronously inside ingest_email, holding the SMTP 250 OK reply hostage for the entire hook duration (~7 min in the reported case). Gmail's MTA timed out around 5–7 min, considered delivery failed, retried — and aimx had no Message-Id dedup, so each retry ingested the message again and re-fired the hook.
  • Layer 1 (root cause): new HookMode::{Sync, Async} parameter on ingest_email. The SMTP DATA handler passes Async, which detaches hooks onto std::thread::spawn so 250 OK returns in seconds. The aimx ingest CLI and all existing tests keep Sync.
  • Layer 2 (defense in depth): per-mailbox in-memory Message-Id cache (backed by sparkv with 24h TTL + auto-eviction). Lazily rebuilt from on-disk frontmatter on first ingest after daemon startup. Entries loaded from disk inherit 24h − (now − received_at) as their TTL, so a daemon restart cannot stretch dedup protection past its real lifetime. Filename date-prefix pre-filter avoids opening files outside the 24h window. Duplicate Message-Ids are silently absorbed (no second .md, no second hook fire) but ingest_email still returns Ok so the SMTP path emits 250 OK and the remote MTA stops retrying.

Files changed

File Change
Cargo.toml + sparkv 0.1 dep
src/mailbox_locks.rs lock_for now returns Arc<MailboxState> (lock + dedup cache)
src/ingest.rs HookMode enum, dedup check + cache insert, async hook spawn branch, disk-scan helpers, 4 new tests
src/smtp/session.rs DATA handler passes HookMode::Async
src/state_handler.rs, src/hook_handler.rs, src/mailbox_handler.rs Updated lock-acquisition call sites for the new state shape
src/smtp/tests.rs Multi-transaction test now generates per-call unique Message-Ids

Test plan

  • cargo build clean
  • cargo clippy -- -D warnings clean
  • cargo fmt -- --check clean
  • cargo test --bin aimx — 1145 pass; 3 macOS-only pre-existing uds_authz failures (Linux-only project per CLAUDE.md)
  • cargo test --test integration — 116/116 pass
  • 4 new dedup tests cover: in-memory cache hit, distinct Message-Ids allowed, lazy disk rebuild, expired entries excluded
  • Existing tests that re-ingested the same fixture twice updated to generate per-call unique Message-Ids — matching real-world client behavior
  • Manual end-to-end on staging: forward a slow-hook email from Gmail, confirm single received email + single firing hook_name=... log line, and a single reply lands in inbox

Out of scope

  • TLS close_notify graceful shutdown in src/smtp/mod.rs — Gmail's "peer closed without close_notify" is a downstream symptom of the timeout. Once 250 OK lands within seconds (this PR), Gmail closes cleanly. If it persists after deploy, a follow-up to call tokio_rustls::shutdown().await before the TCP shutdown is straightforward.
  • Admin command to manually purge a Message-Id from the dedup cache (defer until requested).

A single forwarded email was producing multiple replies because slow
on_receive hooks held the SMTP DATA "250 OK" response for their entire
runtime. Remote MTAs (e.g. Gmail) treated the silent SMTP connection as
a delivery failure, retried, and aimx re-ingested + re-fired the hook on
every retry.

Fix in two layers:

1. Async hook execution
   - New HookMode::{Sync, Async} on ingest_email.
   - SMTP DATA handler now passes HookMode::Async so on_receive hooks
     spawn into a detached std::thread and the "250 OK" goes back to
     the remote MTA within seconds.
   - aimx ingest CLI and all tests keep HookMode::Sync to preserve
     end-to-end semantics (process / test must outlive the hook).

2. Message-Id dedup with 24h TTL (defense in depth)
   - Per-mailbox SparKV-backed cache lives on a new MailboxState that
     replaces the bare lock returned by MailboxLocks::lock_for.
   - Lazily rebuilt from disk on the first ingest into a mailbox after
     daemon start; entries loaded from frontmatter expire 24h after
     their original received_at, so a daemon restart cannot stretch
     dedup protection beyond its real window.
   - Filename date-prefix pre-filter avoids opening files outside the
     24h retention window.
   - Duplicate Message-Ids are silently absorbed: no second .md, no
     second hook fire, but ingest still returns Ok so the SMTP path
     emits 250 OK and the remote MTA stops retrying.

Tests:
  - Four new ingest dedup tests cover the cache hit, miss, lazy
    disk rebuild, and TTL-expiry paths.
  - Existing tests that intentionally re-used a fixture's hard-coded
    Message-Id (file_naming_increments, rapid_same_subject_*,
    duplicate_attachment_filenames_*, concurrent_same_subject_*,
    thread_id_deterministic_for_same_message, and the SMTP
    multi-transaction test) now generate per-call unique Message-Ids
    via small helpers, matching real-world client behavior.

@uzyn uzyn left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes Overview

Two-layer fix for the "single forwarded email → multiple analyst replies" bug. Layer 1: a new HookMode::{Sync, Async} parameter on ingest_email lets the SMTP DATA path detach on_receive hooks onto std::thread::spawn so 250 OK returns in seconds rather than after the hook completes. Layer 2: a per-mailbox SparKV-backed Message-Id dedup cache with 24h TTL, lazily rebuilt from disk on first ingest per mailbox after daemon startup, silently absorbs duplicate inbound deliveries within the retention window. Lock map was refactored from Arc<AsyncMutex<()>> to Arc<MailboxState> so the dedup cache hangs off the same per-mailbox state that already serializes inbound writes.

Scope Alignment

The implementation matches the stated intent. The aimx ingest CLI path explicitly keeps HookMode::Sync (so a short-lived process doesn't exit before its hook), the SMTP DATA path uses HookMode::Async, and all 6 existing call sites of MailboxLocks::lock_for were correctly updated for the new return shape. TLS close_notify and an admin "purge dedup cache" command are explicitly deferred. Out-of-scope test churn (per-call unique Message-Id in fixtures that re-ingest the same email twice) is justified — without it, the new dedup would silently suppress the second ingest and the tests' assertions would no longer measure what they previously measured.

Potential Bugs

1. Auth verification still blocks 250 OK (Non-blocker)

verify_auth (DKIM/SPF/DMARC DNS lookups, including a fresh tokio::runtime::Runtime::new() per ingest) runs at src/ingest.rs ~line 245 before the dedup check and before the hook fork point. A slow recursive resolver or a pathological DKIM record can stall the spawn_blocking worker for many seconds — re-creating the same "remote MTA times out, retries" failure mode the PR is trying to fix, just with a smaller blast radius. The PR's hook-detach addresses the largest synchronous cost (multi-minute hook subprocesses), so this is a meaningful improvement on its own; calling it out so a follow-up isn't surprised when the bug surfaces again at a smaller scale.

2. 24h dedup TTL is shorter than the SMTP retry window (Non-blocker)

DEDUP_TTL = 24h in src/ingest.rs:110. RFC 5321 §4.5.4.1 sets the minimum give-up time at 4-5 days, and Gmail/Postfix defaults are in that range too. A legitimate retry past the 24h mark (rare but possible during prolonged inbound outages) will be treated as a fresh delivery and re-fire on_receive. This is a deliberate tradeoff per the PR description, and worth a one-liner in the constant's doc-comment ("we trade off correctness past 24h for a bounded cache size" or similar) so a future reader understands it's a knob, not an absolute.

3. Empty Message-Id bypasses dedup entirely (Non-blocker)

src/ingest.rs:170 gates dedup on !message_id.is_empty(). Real MTAs always set a Message-Id, but a hand-crafted SMTP client (or a buggy retry path on the sending side) that omits it would slip past dedup every retry and re-fire the hook each time. Worth either (a) falling back to a digest of (date, from, subject, raw[..N]) when Message-Id is missing, or (b) at minimum, an inline comment acknowledging the gap so the next person debugging "why did this hook fire 5×?" finds the answer in the file.

4. read_frontmatter_id_pair only inspects the first 4 KiB (Non-blocker)

src/ingest.rs:357: the disk-rebuild path reads at most 4 KiB. Typical inbound frontmatter is well under 1 KiB, but a long References chain plus many cc recipients (mailing lists) can push past 4 KiB. In that case find("\n+++") fails and read_frontmatter_id_pair returns None, so the entry is silently skipped from the rebuilt cache. The window in which this hurts is narrow (daemon restart + retry of an oversize message within 24h), but worth either bumping to 16 KiB or streaming line-by-line until +++ is found.

5. set_with_ttl failures are silent at the call site, with a real failure mode at scale (Non-blocker)

src/ingest.rs:215: if SparKV returns Err(MaxItemsExceeded) because we hit DEDUP_MAX_ITEMS = 100_000 within the 24h window, the warn-log fires but the Message-Id is not recorded. The very next retry of that message hits a cache miss and re-fires the hook. 100k/24h is ~1.2 inbound messages/second sustained per mailbox — absurd for any realistic aimx deployment, but worth noting because the failure mode silently regresses to "hook re-fires on every retry," which is the exact bug this PR fixes.

6. Catchall mailbox cache rebuild scope (Non-blocker)

When the recipient routes via the *@domain catchall, the dedup cache is keyed on the "catchall" mailbox name (MailboxLocks::lock_for("catchall")) and the disk scan walks <data_dir>/inbox/catchall/. That's correct for catchall-only deployments. For deployments that have both explicit mailboxes and a catchall, a Message-Id legitimately addressed to (say) alice@ and later re-delivered as unknown-recipient@ (different RCPT, same Message-Id) would NOT dedup — the two ingests land in different per-mailbox caches. Likely intentional (different recipients, different mailboxes), but worth a doc comment so it's not a future surprise.

Test Coverage

Strong coverage on the dedup side: ingest_skips_duplicate_message_id, ingest_allows_distinct_message_ids, ingest_dedup_cache_loads_from_disk, and ingest_dedup_cache_excludes_expired exercise the cache hit, miss, disk-rebuild, and TTL-expiry paths respectively. The disk-rebuild test pre-stages a stub .md with hand-written frontmatter rather than going through format_frontmatter — that's a small fragility (if the on-disk frontmatter format ever changes shape, this test won't notice), but acceptable for the cache-load behavior it's actually testing.

Gap on the async-hook side: the most consequential code change (detach the hook into std::thread::spawn so 250 OK returns fast) has zero targeted tests. There's no test that:

  • registers a slow on_receive hook (sleep ≥1s),
  • ingests a message through the SMTP DATA path,
  • asserts 250 OK returns in well under the hook duration,
  • and asserts the hook eventually runs.

The existing SMTP multi-transaction test (src/smtp/tests.rs) verifies the response content but not the response timing, so it would pass equally well with HookMode::Sync. This is the bug the PR exists to fix — it deserves a regression test. Non-blocker because the change is small and visually obvious, but adding it would close the loop.

Code Quality

  • src/ingest.rs:243-279: the async-hook spawn block is ~40 lines inline. Extracting fn spawn_on_receive_async(config, mailbox, md_path, meta) would shorten the match arm and make the failure-log site easier to find. Pure style, non-blocker.
  • The Arc<MailboxState> refactor at src/mailbox_locks.rs is clean and the doc-comment update is on point. The new seen_message_ids_starts_uninitialized test is appropriate.
  • parse_toml_string_value and read_frontmatter_id_pair are sufficient for the fixed-shape frontmatter aimx writes, but they're a hand-rolled subset of TOML. If format_frontmatter ever emits a multi-line string for, say, a very long subject, the line-based parse will silently miss it. Not changed in this PR but worth flagging since this PR is the first reader of frontmatter that doesn't go through serde/toml.

Summary and Recommended Actions

Overall verdict: Ready to merge with non-blocker follow-ups.

The two-layer fix is well-targeted at the root cause and the implementation is correct. Lock-hierarchy invariants are preserved, the dedup cache state lives behind the same per-mailbox serialization as inbound writes, and the cache insert is correctly ordered inside the lock so a concurrent retry can never see "on disk but not in cache." Existing tests were updated correctly (per-call unique Message-Ids in places that re-ingest the same fixture), and the new dedup tests are meaningful.

  • Blockers: none.
  • Non-blockers:
    • No targeted regression test for the async-hook detach behavior (the primary fix).
    • verify_auth still synchronously holds the SMTP DATA path while doing DNS — same failure shape as the original bug, smaller window.
    • DEDUP_TTL = 24h is shorter than the RFC-allowed retry window; document the tradeoff in the constant's doc-comment.
    • Empty Message-Id bypasses dedup; document or fall back to a content digest.
    • 4 KiB read cap in read_frontmatter_id_pair can silently miss long-References frontmatter.
    • Silent MaxItemsExceeded failure path at >100k entries/24h regresses to the original bug under extreme load.
  • Nice-to-haves:
    • Extract the async-hook spawn into a helper for readability.
    • Add a doc comment on the catchall/explicit-mailbox dedup-isolation behavior.

Cycle 1B follow-ups on the slow-inbound dedup PR. No blocker fixes — the
review verdict was "Ready to merge with non-blocker follow-ups."

Code changes:
- Bump read_frontmatter_id_pair scan window from 4 KiB to 16 KiB so a
  long References chain plus many cc recipients (common on mailing
  lists) cannot silently drop a rebuilt cache entry.
- Extract the inline async-hook spawn block into
  spawn_on_receive_async() for readability and a shorter match arm.

New tests:
- src/smtp/tests.rs::test_smtp_data_returns_fast_with_slow_on_receive_hook:
  regression test for the primary fix. Registers a 3s on_receive hook,
  drives an SMTP DATA transaction, asserts 250 OK returns under 1.5s
  (well below the hook duration) and that the hook eventually fires.
  Verified to FAIL when HookMode is reverted to Sync.

Doc-only changes (no behavior change):
- DEDUP_TTL: spell out the RFC-retry-window tradeoff so future readers
  see the 24h value is a knob, not an absolute.
- Dedup gate: inline note that an empty Message-Id bypasses dedup, with
  a pointer for the why-did-this-hook-fire-N-times debug path.
- MaxItemsExceeded warn: explain the regression-to-original-bug shape
  and that DEDUP_MAX_ITEMS is the lever.
- mailbox_locks: document that the dedup cache is per-mailbox by design
  (different RCPTs land in different caches even on the same Message-Id).
- parse_toml_string_value / read_frontmatter_id_pair: flag the
  hand-rolled TOML-subset limits so a future schema change knows where
  to look.

Deferred per review (each is its own non-blocker):
- #1 (verify_auth still blocks 250 OK): out of scope for this PR; its
  own follow-up. Smaller-blast-radius failure mode than the multi-minute
  hook this PR fixed.
@uzyn

uzyn commented May 11, 2026

Copy link
Copy Markdown
Owner Author

Cycle 1B: Review feedback addressed

Pushed 24a2291 addressing the non-blocker review items. Verdict was already "Ready to merge with non-blocker follow-ups"; this cycle closes the loop on most items in-place and explicitly defers the one larger follow-up.

Fixed in code

# Item What changed
2 DEDUP_TTL doc Expanded doc-comment in src/ingest.rs calling out the RFC 5321 §4.5.4.1 retry window vs the chosen 24h value as an explicit tradeoff.
3 Empty Message-Id bypass Inline comment at the dedup gate (src/ingest.rs) acknowledging the gap and pointing the next debugger at the source .eml. Did not add a content-digest fallback — the review itself notes "real MTAs always set a Message-Id," and synthesizing a digest would mask broken-sender bugs that an operator probably wants to see.
4 4 KiB read cap Bumped to 16 KiB in read_frontmatter_id_pair (src/ingest.rs). Doc-comment now explains why (long References + many cc on mailing lists).
5 MaxItemsExceeded silence Added a clarifying comment at the call site (src/ingest.rs) explaining that at >100k entries/24h the failure regresses to the original bug, and pointing at DEDUP_MAX_ITEMS as the knob. The existing tracing::warn! is the operator signal — no code-path change.
6 Catchall scope doc Added a "Per-mailbox keying, by design" section to the MailboxLocks module doc (src/mailbox_locks.rs) covering the alice-vs-catchall scenario.
Quality Spawn-helper extraction Pulled the ~40-line async-hook spawn block out of the match arm into spawn_on_receive_async(config, mailbox, md_path, meta) in src/ingest.rs.
Quality TOML parser limits Added a doc-comment on parse_toml_string_value (and expanded one on read_frontmatter_id_pair) flagging the hand-rolled subset so a future frontmatter.rs schema change knows where to look.

Regression test added

src/smtp/tests.rs::test_smtp_data_returns_fast_with_slow_on_receive_hook

Closes the test-coverage gap the review flagged. Registers a 3-second on_receive hook on alice, drives a full SMTP DATA transaction, and asserts:

  1. 250 OK returns in under 1.5s (half the hook's sleep). Proves the hook ran detached, not inline.
  2. The hook's marker file eventually appears. Proves the detached thread actually ran the hook (not silently dropped on spawn failure).
  3. The email is on disk.

Sanity-checked: I temporarily reverted HookMode::Async to HookMode::Sync in src/smtp/session.rs and the new test fails with the expected message ("250 OK must return well under the hook's 3s sleep (took 3119ms, limit 1500ms)…"). Restored to Async before commit.

Deferred to a follow-up PR

#1verify_auth still blocks 250 OK. This is real (DKIM/SPF/DMARC DNS lookups, including a fresh tokio::runtime::Runtime::new() per ingest, all happen inline before the dedup gate and hook fork point). I'm deferring it because:

  • It's a different bug surface from this PR's scope. The PR fixed the largest synchronous cost (multi-minute hooks); auth verification has a smaller blast radius (DNS-bounded, not user-script-bounded).
  • The fix is non-trivial — either move auth verification onto a spawn_blocking worker with a hard timeout, or detach it the same way hooks were. Both deserve their own focused review rather than being tacked onto this PR.

Will file as a follow-up issue/PR.

Verification

  • cargo test --bin aimx -- ingest:: smtp::tests:: — 117 passed (including the new regression test).
  • cargo test --bin aimx -- hook:: mailbox_locks:: — 42 passed.
  • cargo clippy --all-targets -- -D warnings — clean.
  • cargo fmt -- --check — clean.

(Three pre-existing uds_authz::tests::* failures on this macOS host are unrelated to this PR — those tests assume uid 4_294_967_294 is unmapped, but it resolves to nobody on macOS. The project is Linux-only and CI runs on ubuntu-latest where they pass.)

@uzyn uzyn left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review (cycle 1B, head 24a2291)

Diff scope: 155aef4..24a2291src/ingest.rs (+98/-37), src/mailbox_locks.rs (+13/-0), src/smtp/tests.rs (+156/-0). Build clean, clippy clean, fmt clean, new test passes locally (3.28s).

Resolved (verified against diff)

  • #2 DEDUP_TTL docsrc/ingest.rs:46-52. Doc-comment now spells out the RFC 5321 §4.5.4.1 4–5 day retry ceiling vs. the chosen 24h value and explicitly frames it as a tradeoff. Resolved.
  • #3 Empty Message-Id bypasssrc/ingest.rs:357-365. Inline comment at the dedup gate documents the gap, points a future debugger at the .eml, and justifies the deliberate no-content-digest-fallback choice ("would mask broken-sender bugs that an operator probably wants to see"). Reasonable call; matches the original review's "(a) or (b)" wording. Resolved as a doc-only acknowledgement.
  • #4 4 KiB → 16 KiBsrc/ingest.rs:683 confirms let mut buf = [0u8; 16 * 1024];. Doc-comment src/ingest.rs:671-688 explains the rationale (long References chains + mailing-list cc). Resolved.
  • #5 MaxItemsExceeded silencesrc/ingest.rs:515-524. Inline comment now explains the regression-to-original-bug failure shape at >100k entries/24h and names DEDUP_MAX_ITEMS as the lever. tracing::warn! retained as the operator signal. Resolved.
  • #6 Catchall keyingsrc/mailbox_locks.rs:49-58. New "Per-mailbox keying, by design" module-doc section covers the alice-vs-catchall scenario and points at the migration path (process-wide cache) if cross-mailbox dedup ever becomes a requirement. Resolved.
  • Quality: async-spawn helper extractionsrc/ingest.rs:556-598. spawn_on_receive_async(config: &Config, mailbox: &str, md_path: &Path, meta: &InboundFrontmatter) replaces the ~40-line inline block. Clones (config.clone(), mailbox.to_string(), md_path.to_path_buf(), meta.clone()) and the mailbox_name_for_log separate clone for the error path are all preserved; the spawn-failure policy (log only, no fallback to sync) is intact. Equivalent behavior to the prior inline code. Resolved.
  • Quality: TOML parser limits doc-commentparse_toml_string_value (src/ingest.rs:741-749) and read_frontmatter_id_pair (src/ingest.rs:684-695) both call out the hand-rolled-TOML-subset constraints and tell the next reader to "swap this for a real toml crate call rather than expanding the regex-in-disguise" if frontmatter.rs ever grows multi-line strings. Resolved.
  • Test-coverage gap: async-hook detach regressionsrc/smtp/tests.rs:1626-1690 test_smtp_data_returns_fast_with_slow_on_receive_hook. Verified independently:
    • Registers a 3s sleep hook on alice (/bin/sh -c "sleep 3 && touch <marker>").
    • Drives the full SMTP DATA path: EHLO → MAIL FROM → RCPT TO → DATA → body → . terminator → QUIT. Not a direct ingest_email call.
    • Asserts (1) 250 OK after dot-terminator returns in <1500ms (HOOK_SLEEP_SECS * 1000 / 2) and (2) the marker file eventually appears within HOOK_SLEEP_SECS + 10 seconds. (2) is what catches the "spawn silently dropped the closure" regression that (1) alone would miss.
    • Sets AIMX_SANDBOX_FORCE_FALLBACK=1 via unsafe { set_var } so the hook runs without root / systemd-run. The var is process-wide but src/hook.rs:665 already sets it the same way, so this doesn't introduce a new pollution pattern.
    • Sync would block ≥3000ms (implementer reports 3119ms with HookMode::Sync). The 1500ms vs 3000ms separation is an order-of-magnitude gap, not a tight margin — robust against CI noise.
    • Belt-and-braces assert_eq!(md_files.len(), 1) at the end pins that ingestion itself still completed.
    • Local run: passes in 3.28s. Resolved.

Deferred (accepted)

  • #1 verify_auth still blocks 250 OK — kept out of scope per the implementer's top-level comment. I agree with the deferral. The synchronous tokio::runtime::Runtime::new() plus DKIM/SPF/DMARC DNS lookups at src/ingest.rs:420-465 still gate 250 OK for several seconds on a slow recursive resolver, but the failure radius is DNS-bounded, not user-script-bounded — meaningfully different from the multi-minute hook this PR fixed. A proper fix wants its own focused PR (decide between spawn_blocking + hard timeout vs. full async detach, both with retention of trust-evaluation semantics for the frontmatter.trusted field). Track as a follow-up issue.

Still unresolved

None.

New issues introduced by the cycle 1B changes

None.

  • The spawn_on_receive_async extraction is a faithful refactor — the only structural change is mailbox: &str taking ownership via to_string() instead of mailbox.clone() on a String reference, which is semantically identical. All clones in the closure (config_owned, mailbox_name, mailbox_name_for_log, md_path_owned, meta_owned) are correctly owned and moved.
  • The unsafe { set_var } block in the new test is unavoidable on Rust 2024 edition; the safety comment is accurate.
  • The 16 KiB read cap is a strict superset of the prior 4 KiB behavior — no regression risk.

Scope creep check

None. Diff is exactly the 7 prior non-blockers + the requested regression test. No unrelated refactors.

Banned-token check (per CLAUDE.local.md)

PR-touched files (src/ingest.rs, src/mailbox_locks.rs, src/smtp/tests.rs, src/smtp/session.rs, src/hook_handler.rs, src/mailbox_handler.rs, src/state_handler.rs) introduce zero hits for \bSprint\b, \bS\d+[-.]\d+\b, \bFR-\d+[a-z]?\b, "User Story", "Acceptance criteria". (Pre-existing tokens in src/mailbox_handler.rs lines 691/1585/1767 are untouched by this PR and out of scope.)

Summary and verdict

Overall verdict: Ready to merge.

All six prior non-blockers are addressed in code or in doc-comments exactly as the review asked. The deferred #1 is justified, scoped, and the implementer has committed to a follow-up. The async-hook regression test is well-designed: it drives the full SMTP DATA path (not just ingest_email), asserts both timing AND eventual hook execution, has a 2× margin so CI noise can't flake it, and the implementer's sync-mode sanity check (3119ms vs 1500ms) confirms it actually fails under regression. No new bugs, no scope creep, no banned tokens, no test-quality concerns.

  • Blockers: none.
  • Non-blockers: none.
  • Nice-to-haves: none (the follow-up for verify_auth is already tracked in the implementer's comment).

Recommended merge commit message

fix: prevent duplicate hook fires on slow inbound deliveries

Two-layer fix for the "single forwarded email triggers multiple hook
fires" bug.

Layer 1 (root cause): new HookMode::{Sync, Async} parameter on
ingest_email. The SMTP DATA handler passes Async, which detaches
on_receive hooks onto std::thread::spawn so 250 OK returns in seconds
rather than after the hook completes. The aimx ingest CLI keeps Sync so
the short-lived process doesn't exit before its hook.

Layer 2 (defense in depth): per-mailbox in-memory Message-Id cache
(sparkv-backed, 24h TTL, auto-eviction). Lazily rebuilt from on-disk
frontmatter on first ingest per mailbox after daemon startup. Filename
date-prefix pre-filter keeps the rebuild cheap. Duplicate Message-Ids
within 24h are silently absorbed: no second .md, no second hook fire,
but ingest still returns Ok so the SMTP path emits 250 OK and the
remote MTA stops retrying.

Tests: 4 new dedup unit tests (cache hit, distinct ids allowed, lazy
disk rebuild, expired entries excluded) plus a new SMTP-path regression
test (test_smtp_data_returns_fast_with_slow_on_receive_hook) that pins
the timing contract by registering a 3s hook and asserting 250 OK
returns in well under that duration. Existing tests that re-ingested
the same fixture twice were updated to emit per-call unique Message-Ids.

@uzyn uzyn merged commit a100600 into main May 11, 2026
8 checks passed
@uzyn uzyn deleted the worktree-multi-bug-fix branch May 11, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant