Skip to content

test(shipper-registry): stabilize concurrent_*_checks macOS flake#237

Merged
EffortlessSteven merged 1 commit into
mainfrom
fix/registry-multi-server-flake-42
May 12, 2026
Merged

test(shipper-registry): stabilize concurrent_*_checks macOS flake#237
EffortlessSteven merged 1 commit into
mainfrom
fix/registry-multi-server-flake-42

Conversation

@EffortlessSteven

Copy link
Copy Markdown
Member

Summary

Fixes a macOS-specific test flake that hit three times in this rollout session (#233, #234, #236): concurrent_version_exists_checks (and any test using with_multi_server) timing out with error sending request for url ... operation timed out.

Root cause

with_multi_server's accept loop blocked on handler(req) until each response was fully written before returning to recv_timeout. With 5 concurrent reqwest clients hitting the same loopback socket, the remaining clients sat in the kernel's TCP backlog long enough to exceed reqwest's default OS-level connect timeout. Windows and Linux runners process the queue fast enough to mask the bug; macOS does not.

Fix

Spawn one worker thread per accepted request and let the accept loop return to recv_timeout immediately. The accept loop still serialises on recv_timeout (tiny_http requires that), but handlers run in parallel — so the kernel's listen queue drains as fast as connections arrive.

Other changes:

  • recv_timeout bumped from 30s to 60s for additional headroom.
  • Trait bound on the handler closure goes from Fn + Send + 'static to Fn + Send + Sync + 'static (required to wrap the handler in Arc for clone-into-workers). All existing call sites use closures that already satisfy Sync.
  • The accept thread joins worker threads before returning so any panic in a handler surfaces in CI rather than being orphaned.

Test plan

  • cargo test -p shipper-registry --lib — 258/258 passing locally on Windows.
  • cargo clippy -p shipper-registry --all-targets -- -D warnings — clean.
  • cargo fmt --all -- --check — clean.
  • CI green — particularly the macOS nextest matrix leg, which is the leg the flake fires on.

If CI hits the same flake even after this fix, escalation is to:

  1. Reduce concurrency from 5 → 3 in concurrent_*_checks (smaller TCP backlog).
  2. Switch the reqwest client to an explicit .connect_timeout(60s) to surface the real timeout source instead of inheriting the OS default.
  3. Move both tests behind #[cfg_attr(target_os = "macos", ignore)] as a last-resort posture (not recommended).

Refs the macOS flake tracked as task #42.

Stabilises `concurrent_version_exists_checks` (and any other test using
`with_multi_server`) on slow macOS CI runners. Hit three times in a
single rollout session (#233, #234, #236) as `version_exists: registry
request failed -> operation timed out` against the local tiny_http
mock.

Root cause: the helper's accept loop blocked on `handler(req)` until
each response was fully written before returning to `recv_timeout`.
With 5 concurrent reqwest clients hitting the same loopback socket,
the remaining clients sat in the kernel's TCP backlog long enough to
exceed reqwest's default OS-level connect timeout. Windows and Linux
runners process the queue fast enough to mask the bug; macOS does
not.

Fix: spawn one worker thread per accepted request and let the accept
loop return to `recv_timeout` immediately. The accept loop still
serialises on `recv_timeout` (tiny_http requires that), but handlers
run in parallel, so the kernel's listen queue drains as fast as
connections arrive.

Other changes:
  - `recv_timeout` bumped from 30s to 60s for additional headroom.
  - Trait bound on the handler closure goes from `Fn + Send + 'static`
    to `Fn + Send + Sync + 'static` (required to wrap the handler in
    `Arc` for clone-into-workers). All existing call sites use
    closures that already satisfy `Sync`.
  - The accept thread joins worker threads before returning so any
    panic in a handler surfaces in CI.

cargo test -p shipper-registry --lib passes 258/258 locally (Windows).
The fix targets a macOS-specific timing bug, so CI is the real
verification.
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@EffortlessSteven has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 43 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ac6b403e-345a-4331-bd98-739d12cdb279

📥 Commits

Reviewing files that changed from the base of the PR and between 6b7c6b6 and 32fb1db.

📒 Files selected for processing (1)
  • crates/shipper-registry/src/context.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/registry-multi-server-flake-42

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@factory-droid

factory-droid Bot commented May 12, 2026

Copy link
Copy Markdown

Droid finished @EffortlessSteven's task —— View job


Droid review complete — no actionable findings emitted.

Inspected surfaces:

  • crates/shipper-registry/src/context.rs:: with_multi_server (accept-and-dispatch refactor, recv_timeout 30s→60s, Sync trait bound)

Checks performed:

  • Concurrent TCP handler deadlock on macOS (P0 surface for test reliability)
  • Worker-join-before-exit ordering
  • Trait-bound soundness for all existing callers

Why no comments:
Diff is confined to test infrastructure; no publish-state, registry-truth, token-redaction, or resume-contract surfaces touched.

Residual risk:
Registry-side ambiguity remains; this PR does not address Reconcile (out of scope).

Validation signal:

  • Observed: diff is minimal and scoped to test harness
  • Reported: PR description confirms macOS flake reproduced on CI
  • Not verified: actual macOS CI run not observed in this review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32fb1db28e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
}
for w in workers {
let _ = w.join();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate worker thread panics

Failure mode: when any with_multi_server handler assertion or respond(...).expect(...) panics, this changed code now drops the JoinHandle error, so the outer handle.join().expect(...) still succeeds and the test can pass after the server worker failed. Why here: before this diff the handler ran on the server thread, so those panics were propagated by the existing joins; after dispatching to workers, let _ = w.join() suppresses them. Fix direction: unwrap/expect each worker join or otherwise re-panic in the server thread. Validation: cargo test -p shipper-registry --lib concurrent_version_exists_checks plus a temporary handler panic would demonstrate propagation. Confidence: high.

Useful? React with 👍 / 👎.

@EffortlessSteven EffortlessSteven merged commit 402ca09 into main May 12, 2026
27 checks passed
@EffortlessSteven EffortlessSteven deleted the fix/registry-multi-server-flake-42 branch May 12, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant