Prevent permanent backend unhealthy marking after startup race by yrobla · Pull Request #4290 · stacklok/toolhive

yrobla · 2026-03-20T11:19:55Z

Summary

Health checks could permanently mark backends as unhealthy due to a race condition: http.DefaultTransport was shared across all backend clients, causing stale keep-alive connections to replaced K8s pods to persist and return 4xx indefinitely.

Replace the shared http.DefaultTransport reference with newBackendTransport(), which clones DefaultTransport (or constructs an equivalent) per defaultClientFactory call. Each client gets its own isolated connection pool, so a replaced pod's stale connections cannot affect requests to other backends or future calls.

Fixes #4278

Type of change

Test plan

Unit tests (task test)
E2E tests (task test-e2e)
Linting (task lint-fix)
Manual testing (describe below)

Copilot

Pull request overview

Fixes a vMCP health-monitor startup race where backends could be marked unhealthy indefinitely by improving HTTP connection isolation, error classification, and status reporting responsiveness (Fixes #4278).

Changes:

Clone http.DefaultTransport for each backend client creation to isolate connection pools and avoid stale keep-alive connections after pod replacement.
Map mcp-go transport sentinel errors (ErrUnauthorized, ErrLegacySSEServer) to vMCP sentinel errors in wrapBackendError, and extend IsAuthenticationError to match "unauthorized (401)".
Add a 2s DynamicRegistry version poller to trigger immediate status reporting (and backend refresh) when backends are added/removed.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/vmcp/server/status_reporting.go	Adds DynamicRegistry version polling to trigger immediate status reports on backend registry changes.
pkg/vmcp/server/status_reporting_test.go	Adds a unit test ensuring version changes trigger an immediate status report (no need to wait for the main interval).
pkg/vmcp/health/checker_test.go	Extends auth-error classification tests to include mcp-go’s `"unauthorized (401)"` format and wrapped auth sentinel behavior.
pkg/vmcp/errors.go	Updates `IsAuthenticationError` string matching to include `"unauthorized (401)"`.
pkg/vmcp/client/client.go	Clones the base HTTP transport per client creation and adds explicit mapping for mcp-go transport sentinel errors in `wrapBackendError`.
pkg/vmcp/client/client_test.go	Adds tests verifying `wrapBackendError` maps mcp-go transport sentinel errors to the correct vMCP sentinel errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/client/client.go

pkg/vmcp/server/status_reporting.go

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad39e47b36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/vmcp/server/status_reporting.go

codecov · 2026-03-20T11:36:08Z

Codecov Report

❌ Patch coverage is 26.66667% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.07%. Comparing base (4d4fbe2) to head (224debe).
⚠️ Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/vmcp/client/client.go	26.66%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4290      +/-   ##
==========================================
+ Coverage   68.95%   69.07%   +0.11%     
==========================================
  Files         473      477       +4     
  Lines       47854    48088     +234     
==========================================
+ Hits        33000    33219     +219     
- Misses      12266    12286      +20     
+ Partials     2588     2583       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/client/client.go

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/client/client.go

pkg/vmcp/mocks/mock_backend_client.go

pkg/vmcp/health/monitor.go

jerm-dro

Thanks for jumping on #4278 quickly — the error mapping fixes and version ticker are solid, and the root cause analysis on the shared DefaultTransport is exactly right.

I noticed the PR description says "cache one *http.Transport per backend (rather than cloning per call)." Do I understand correctly that the alternative considered was cloning per defaultClientFactory call? I'd actually prefer that simpler approach. The cache reintroduces the stale connection problem it's solving — then FlushIdleConnections reactively fixes it on health failure. That's a lot of machinery (ConnectionFlusher interface, flush calls in monitor and health checker, transport mutex, cache map) to manage a problem that wouldn't exist without the cache.

I understand the concern about repeated TLS handshakes from cloning per call, but we don't have evidence that connection setup overhead is a problem at our expected request rates. The newBackendTransport() helper (clone instead of sharing DefaultTransport) is the actual fix — we can add caching later if profiling shows it's needed.

The error mapping and version ticker changes are valid but independent of the transport fix. Could you split this into separate PRs so they can be reviewed on their own merits?

yrobla · 2026-03-23T10:17:04Z

Thanks for the detailed review!

Agreed on both points:

Transport fix — I'll strip out the cache machinery and use newBackendTransport() directly That's the actual fix and it's sufficient.
Error mapping — I'll move the mcp-go sentinel handling and the \ pattern additions into a follow-up PR so they can be reviewed independently.

Working on the simplification now.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fixed

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/client/client.go

…race Health checks could permanently mark backends as unhealthy due to two issues: shared http.DefaultTransport kept stale keep-alive connections to replaced K8s pods returning 4xx indefinitely, and mcp-go sentinel errors (ErrUnauthorized, ErrLegacySSEServer) were not recognized by the error classification chain, causing auth failures to surface as generic backend unavailability. - Clone http.DefaultTransport per client factory call to isolate connection pools and avoid stale connections after pod replacement - Map transport.ErrUnauthorized to ErrAuthenticationFailed and transport.ErrLegacySSEServer to ErrBackendUnavailable in wrapBackendError before falling back to string-based detection - Add "unauthorized (401)" pattern to IsAuthenticationError to match mcp-go's ErrUnauthorized string format - Poll DynamicRegistry version every 2s and trigger an immediate status report when backends are added or removed, rather than waiting for the full 30s reporting interval Closes: #4278

yrobla requested review from JAORMX, amirejaz, jerm-dro and jhrozek as code owners March 20, 2026 11:19

yrobla requested a review from Copilot March 20, 2026 11:20

github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 20, 2026

Copilot started reviewing on behalf of yrobla March 20, 2026 11:20 View session

yrobla force-pushed the issue-4278 branch from ad39e47 to 29aca63 Compare March 20, 2026 11:21

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026

Copilot AI reviewed Mar 20, 2026

View reviewed changes

pkg/vmcp/client/client.go Outdated Show resolved Hide resolved

pkg/vmcp/server/status_reporting.go Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

pkg/vmcp/server/status_reporting.go Outdated Show resolved Hide resolved

yrobla force-pushed the issue-4278 branch from 29aca63 to acd1453 Compare March 20, 2026 11:32

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026

yrobla requested a review from Copilot March 20, 2026 11:36

Copilot started reviewing on behalf of yrobla March 20, 2026 11:36 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

pkg/vmcp/client/client.go Outdated Show resolved Hide resolved

pkg/vmcp/client/client.go Outdated Show resolved Hide resolved

yrobla force-pushed the issue-4278 branch from acd1453 to e5789b3 Compare March 20, 2026 11:46

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 20, 2026

yrobla force-pushed the issue-4278 branch from e5789b3 to ff08727 Compare March 20, 2026 11:47

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026

yrobla force-pushed the issue-4278 branch from ff08727 to f64d636 Compare March 20, 2026 11:50

yrobla requested a review from Copilot March 20, 2026 11:50

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026

Copilot started reviewing on behalf of yrobla March 20, 2026 11:51 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

yrobla force-pushed the issue-4278 branch from f64d636 to 7049642 Compare March 20, 2026 13:18

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026

yrobla changed the title ~~fix(vmcp): prevent permanent backend unhealthy marking after startup race~~ Prevent permanent backend unhealthy marking after startup race Mar 20, 2026

jerm-dro previously requested changes Mar 20, 2026

View reviewed changes

jerm-dro mentioned this pull request Mar 20, 2026

Add code review assist skill and expand vMCP anti-patterns #4302

Merged

2 tasks

yrobla force-pushed the issue-4278 branch from 7049642 to 79264f7 Compare March 23, 2026 10:26

yrobla requested a review from Copilot March 23, 2026 10:27

github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 23, 2026

Copilot started reviewing on behalf of yrobla March 23, 2026 10:27 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

yrobla force-pushed the issue-4278 branch from 79264f7 to 81e7cc2 Compare March 23, 2026 10:37

yrobla requested a review from Copilot March 23, 2026 10:37

github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 23, 2026

Copilot started reviewing on behalf of yrobla March 23, 2026 10:38 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

pkg/vmcp/client/client.go Show resolved Hide resolved

yrobla force-pushed the issue-4278 branch from 81e7cc2 to 224debe Compare March 23, 2026 10:44

github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 23, 2026

jerm-dro approved these changes Mar 23, 2026

View reviewed changes

yrobla merged commit fadfd8a into main Mar 23, 2026
40 checks passed

yrobla deleted the issue-4278 branch March 23, 2026 14:32

Conversation

yrobla commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

codecov bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerm-dro left a comment

Choose a reason for hiding this comment

Uh oh!

yrobla commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yrobla commented Mar 20, 2026 •

edited

Loading

codecov bot commented Mar 20, 2026 •

edited

Loading

yrobla commented Mar 23, 2026 •

edited

Loading