Prevent permanent backend unhealthy marking after startup race#4290
Prevent permanent backend unhealthy marking after startup race#4290
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a vMCP health-monitor startup race where backends could be marked unhealthy indefinitely by improving HTTP connection isolation, error classification, and status reporting responsiveness (Fixes #4278).
Changes:
- Clone
http.DefaultTransportfor each backend client creation to isolate connection pools and avoid stale keep-alive connections after pod replacement. - Map mcp-go transport sentinel errors (
ErrUnauthorized,ErrLegacySSEServer) to vMCP sentinel errors inwrapBackendError, and extendIsAuthenticationErrorto match"unauthorized (401)". - Add a 2s DynamicRegistry version poller to trigger immediate status reporting (and backend refresh) when backends are added/removed.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/server/status_reporting.go | Adds DynamicRegistry version polling to trigger immediate status reports on backend registry changes. |
| pkg/vmcp/server/status_reporting_test.go | Adds a unit test ensuring version changes trigger an immediate status report (no need to wait for the main interval). |
| pkg/vmcp/health/checker_test.go | Extends auth-error classification tests to include mcp-go’s "unauthorized (401)" format and wrapped auth sentinel behavior. |
| pkg/vmcp/errors.go | Updates IsAuthenticationError string matching to include "unauthorized (401)". |
| pkg/vmcp/client/client.go | Clones the base HTTP transport per client creation and adds explicit mapping for mcp-go transport sentinel errors in wrapBackendError. |
| pkg/vmcp/client/client_test.go | Adds tests verifying wrapBackendError maps mcp-go transport sentinel errors to the correct vMCP sentinel errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad39e47b36
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4290 +/- ##
==========================================
+ Coverage 68.95% 69.07% +0.11%
==========================================
Files 473 477 +4
Lines 47854 48088 +234
==========================================
+ Hits 33000 33219 +219
- Misses 12266 12286 +20
+ Partials 2588 2583 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jerm-dro
left a comment
There was a problem hiding this comment.
Thanks for jumping on #4278 quickly — the error mapping fixes and version ticker are solid, and the root cause analysis on the shared DefaultTransport is exactly right.
I noticed the PR description says "cache one *http.Transport per backend (rather than cloning per call)." Do I understand correctly that the alternative considered was cloning per defaultClientFactory call? I'd actually prefer that simpler approach. The cache reintroduces the stale connection problem it's solving — then FlushIdleConnections reactively fixes it on health failure. That's a lot of machinery (ConnectionFlusher interface, flush calls in monitor and health checker, transport mutex, cache map) to manage a problem that wouldn't exist without the cache.
I understand the concern about repeated TLS handshakes from cloning per call, but we don't have evidence that connection setup overhead is a problem at our expected request rates. The newBackendTransport() helper (clone instead of sharing DefaultTransport) is the actual fix — we can add caching later if profiling shows it's needed.
The error mapping and version ticker changes are valid but independent of the transport fix. Could you split this into separate PRs so they can be reviewed on their own merits?
|
Thanks for the detailed review! Agreed on both points:
Working on the simplification now. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…race Health checks could permanently mark backends as unhealthy due to two issues: shared http.DefaultTransport kept stale keep-alive connections to replaced K8s pods returning 4xx indefinitely, and mcp-go sentinel errors (ErrUnauthorized, ErrLegacySSEServer) were not recognized by the error classification chain, causing auth failures to surface as generic backend unavailability. - Clone http.DefaultTransport per client factory call to isolate connection pools and avoid stale connections after pod replacement - Map transport.ErrUnauthorized to ErrAuthenticationFailed and transport.ErrLegacySSEServer to ErrBackendUnavailable in wrapBackendError before falling back to string-based detection - Add "unauthorized (401)" pattern to IsAuthenticationError to match mcp-go's ErrUnauthorized string format - Poll DynamicRegistry version every 2s and trigger an immediate status report when backends are added or removed, rather than waiting for the full 30s reporting interval Closes: #4278
Summary
Health checks could permanently mark backends as unhealthy due to a race condition:
http.DefaultTransportwas shared across all backend clients, causing stale keep-alive connections to replaced K8s pods to persist and return 4xx indefinitely.http.DefaultTransportreference withnewBackendTransport(), which clonesDefaultTransport(or constructs an equivalent) perdefaultClientFactorycall. Each client gets its own isolated connection pool, so a replaced pod's stale connections cannot affect requests to other backends or future calls.Fixes #4278
Type of change
Test plan
task test)task test-e2e)task lint-fix)