test(shipper-registry): stabilize concurrent_*_checks macOS flake#237
Conversation
Stabilises `concurrent_version_exists_checks` (and any other test using `with_multi_server`) on slow macOS CI runners. Hit three times in a single rollout session (#233, #234, #236) as `version_exists: registry request failed -> operation timed out` against the local tiny_http mock. Root cause: the helper's accept loop blocked on `handler(req)` until each response was fully written before returning to `recv_timeout`. With 5 concurrent reqwest clients hitting the same loopback socket, the remaining clients sat in the kernel's TCP backlog long enough to exceed reqwest's default OS-level connect timeout. Windows and Linux runners process the queue fast enough to mask the bug; macOS does not. Fix: spawn one worker thread per accepted request and let the accept loop return to `recv_timeout` immediately. The accept loop still serialises on `recv_timeout` (tiny_http requires that), but handlers run in parallel, so the kernel's listen queue drains as fast as connections arrive. Other changes: - `recv_timeout` bumped from 30s to 60s for additional headroom. - Trait bound on the handler closure goes from `Fn + Send + 'static` to `Fn + Send + Sync + 'static` (required to wrap the handler in `Arc` for clone-into-workers). All existing call sites use closures that already satisfy `Sync`. - The accept thread joins worker threads before returning so any panic in a handler surfaces in CI. cargo test -p shipper-registry --lib passes 258/258 locally (Windows). The fix targets a macOS-specific timing bug, so CI is the real verification.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Droid finished @EffortlessSteven's task —— View job Droid review complete — no actionable findings emitted. Inspected surfaces:
Checks performed:
Why no comments: Residual risk: Validation signal:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32fb1db28e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| } | ||
| for w in workers { | ||
| let _ = w.join(); |
There was a problem hiding this comment.
Propagate worker thread panics
Failure mode: when any with_multi_server handler assertion or respond(...).expect(...) panics, this changed code now drops the JoinHandle error, so the outer handle.join().expect(...) still succeeds and the test can pass after the server worker failed. Why here: before this diff the handler ran on the server thread, so those panics were propagated by the existing joins; after dispatching to workers, let _ = w.join() suppresses them. Fix direction: unwrap/expect each worker join or otherwise re-panic in the server thread. Validation: cargo test -p shipper-registry --lib concurrent_version_exists_checks plus a temporary handler panic would demonstrate propagation. Confidence: high.
Useful? React with 👍 / 👎.
Summary
Fixes a macOS-specific test flake that hit three times in this rollout session (#233, #234, #236):
concurrent_version_exists_checks(and any test usingwith_multi_server) timing out witherror sending request for url ... operation timed out.Root cause
with_multi_server's accept loop blocked onhandler(req)until each response was fully written before returning torecv_timeout. With 5 concurrent reqwest clients hitting the same loopback socket, the remaining clients sat in the kernel's TCP backlog long enough to exceed reqwest's default OS-level connect timeout. Windows and Linux runners process the queue fast enough to mask the bug; macOS does not.Fix
Spawn one worker thread per accepted request and let the accept loop return to
recv_timeoutimmediately. The accept loop still serialises onrecv_timeout(tiny_http requires that), but handlers run in parallel — so the kernel's listen queue drains as fast as connections arrive.Other changes:
recv_timeoutbumped from 30s to 60s for additional headroom.Fn + Send + 'statictoFn + Send + Sync + 'static(required to wrap the handler inArcfor clone-into-workers). All existing call sites use closures that already satisfySync.Test plan
cargo test -p shipper-registry --lib— 258/258 passing locally on Windows.cargo clippy -p shipper-registry --all-targets -- -D warnings— clean.cargo fmt --all -- --check— clean.If CI hits the same flake even after this fix, escalation is to:
concurrent_*_checks(smaller TCP backlog)..connect_timeout(60s)to surface the real timeout source instead of inheriting the OS default.#[cfg_attr(target_os = "macos", ignore)]as a last-resort posture (not recommended).Refs the macOS flake tracked as task #42.