Skip to content

Prevent permanent backend unhealthy marking after startup race#4290

Merged
yrobla merged 1 commit intomainfrom
issue-4278
Mar 23, 2026
Merged

Prevent permanent backend unhealthy marking after startup race#4290
yrobla merged 1 commit intomainfrom
issue-4278

Conversation

@yrobla
Copy link
Copy Markdown
Contributor

@yrobla yrobla commented Mar 20, 2026

Summary

Health checks could permanently mark backends as unhealthy due to a race condition: http.DefaultTransport was shared across all backend clients, causing stale keep-alive connections to replaced K8s pods to persist and return 4xx indefinitely.

  • Replace the shared http.DefaultTransport reference with newBackendTransport(), which clones DefaultTransport (or constructs an equivalent) per defaultClientFactory call. Each client gets its own isolated connection pool, so a replaced pod's stale connections cannot affect requests to other backends or future calls.

Fixes #4278

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

@yrobla yrobla requested a review from Copilot March 20, 2026 11:20
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 20, 2026
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a vMCP health-monitor startup race where backends could be marked unhealthy indefinitely by improving HTTP connection isolation, error classification, and status reporting responsiveness (Fixes #4278).

Changes:

  • Clone http.DefaultTransport for each backend client creation to isolate connection pools and avoid stale keep-alive connections after pod replacement.
  • Map mcp-go transport sentinel errors (ErrUnauthorized, ErrLegacySSEServer) to vMCP sentinel errors in wrapBackendError, and extend IsAuthenticationError to match "unauthorized (401)".
  • Add a 2s DynamicRegistry version poller to trigger immediate status reporting (and backend refresh) when backends are added/removed.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/vmcp/server/status_reporting.go Adds DynamicRegistry version polling to trigger immediate status reports on backend registry changes.
pkg/vmcp/server/status_reporting_test.go Adds a unit test ensuring version changes trigger an immediate status report (no need to wait for the main interval).
pkg/vmcp/health/checker_test.go Extends auth-error classification tests to include mcp-go’s "unauthorized (401)" format and wrapped auth sentinel behavior.
pkg/vmcp/errors.go Updates IsAuthenticationError string matching to include "unauthorized (401)".
pkg/vmcp/client/client.go Clones the base HTTP transport per client creation and adds explicit mapping for mcp-go transport sentinel errors in wrapBackendError.
pkg/vmcp/client/client_test.go Adds tests verifying wrapBackendError maps mcp-go transport sentinel errors to the correct vMCP sentinel errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad39e47b36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 26.66667% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.07%. Comparing base (4d4fbe2) to head (224debe).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/client/client.go 26.66% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4290      +/-   ##
==========================================
+ Coverage   68.95%   69.07%   +0.11%     
==========================================
  Files         473      477       +4     
  Lines       47854    48088     +234     
==========================================
+ Hits        33000    33219     +219     
- Misses      12266    12286      +20     
+ Partials     2588     2583       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@yrobla yrobla requested a review from Copilot March 20, 2026 11:50
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@yrobla yrobla changed the title fix(vmcp): prevent permanent backend unhealthy marking after startup race Prevent permanent backend unhealthy marking after startup race Mar 20, 2026
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on #4278 quickly — the error mapping fixes and version ticker are solid, and the root cause analysis on the shared DefaultTransport is exactly right.

I noticed the PR description says "cache one *http.Transport per backend (rather than cloning per call)." Do I understand correctly that the alternative considered was cloning per defaultClientFactory call? I'd actually prefer that simpler approach. The cache reintroduces the stale connection problem it's solving — then FlushIdleConnections reactively fixes it on health failure. That's a lot of machinery (ConnectionFlusher interface, flush calls in monitor and health checker, transport mutex, cache map) to manage a problem that wouldn't exist without the cache.

I understand the concern about repeated TLS handshakes from cloning per call, but we don't have evidence that connection setup overhead is a problem at our expected request rates. The newBackendTransport() helper (clone instead of sharing DefaultTransport) is the actual fix — we can add caching later if profiling shows it's needed.

The error mapping and version ticker changes are valid but independent of the transport fix. Could you split this into separate PRs so they can be reviewed on their own merits?

@yrobla
Copy link
Copy Markdown
Contributor Author

yrobla commented Mar 23, 2026

Thanks for the detailed review!

Agreed on both points:

  1. Transport fix — I'll strip out the cache machinery and use newBackendTransport() directly That's the actual fix and it's sufficient.

  2. Error mapping — I'll move the mcp-go sentinel handling and the \ pattern additions into a follow-up PR so they can be reviewed independently.

Working on the simplification now.

@yrobla yrobla requested a review from Copilot March 23, 2026 10:27
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla requested a review from Copilot March 23, 2026 10:37
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…race

Health checks could permanently mark backends as unhealthy due to two
issues: shared http.DefaultTransport kept stale keep-alive connections
to replaced K8s pods returning 4xx indefinitely, and mcp-go sentinel
errors (ErrUnauthorized, ErrLegacySSEServer) were not recognized by
the error classification chain, causing auth failures to surface as
generic backend unavailability.

- Clone http.DefaultTransport per client factory call to isolate
  connection pools and avoid stale connections after pod replacement
- Map transport.ErrUnauthorized to ErrAuthenticationFailed and
  transport.ErrLegacySSEServer to ErrBackendUnavailable in
  wrapBackendError before falling back to string-based detection
- Add "unauthorized (401)" pattern to IsAuthenticationError to match
  mcp-go's ErrUnauthorized string format
- Poll DynamicRegistry version every 2s and trigger an immediate
  status report when backends are added or removed, rather than
  waiting for the full 30s reporting interval

Closes: #4278
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 23, 2026
@yrobla yrobla merged commit fadfd8a into main Mar 23, 2026
40 checks passed
@yrobla yrobla deleted the issue-4278 branch March 23, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vMCP health monitor permanently marks backends unhealthy after startup race

4 participants