fix(core): coalesce MCP server rediscovery#3818
Conversation
|
|
||
| expect(disconnectCallsBeforeResolve).toBe(1); | ||
| expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(2); | ||
| expect(replacementClients).toHaveLength(1); |
There was a problem hiding this comment.
[Suggestion] The concurrent coalesce test doesn't verify that the serverDiscoveryPromises map is cleaned up after completion. If the finally block cleanup were ever broken by a future refactor, the existing test wouldn't catch it — a stale resolved promise would silently block all subsequent discovery for that server.
Consider adding a third call after the concurrent pair resolves, asserting it creates a fresh client rather than being coalesced into a stale promise:
| expect(replacementClients).toHaveLength(1); | |
| resolveDisconnect(); | |
| await Promise.all([firstRediscovery, secondRediscovery]); | |
| expect(disconnectCallsBeforeResolve).toBe(1); | |
| expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(2); | |
| expect(replacementClients).toHaveLength(1); | |
| expect(replacementClients[0].connect).toHaveBeenCalledOnce(); | |
| expect(replacementClients[0].discover).toHaveBeenCalledOnce(); | |
| // Verify map was cleaned up: a third call should do real work, | |
| // not get coalesced into a stale promise. | |
| const thirdRediscovery = manager.discoverMcpToolsForServer( | |
| 'test-server', | |
| {} as unknown as Config, | |
| ); | |
| await thirdRediscovery; | |
| expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(3); | |
| expect(replacementClients).toHaveLength(2); | |
| expect(replacementClients[1].connect).toHaveBeenCalledOnce(); | |
| expect(replacementClients[1].discover).toHaveBeenCalledOnce(); |
— deepseek-v4-pro via Qwen Code /review
|
Implemented the follow-up test coverage: after the concurrent rediscovery pair completes, the test now performs a third rediscovery and asserts it creates/connects/discovers a fresh MCP client, which would fail if the discovery promise map kept a stale resolved entry. Verification (from repo root): npm ci --ignore-scripts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
npx prettier --check packages/core/src/tools/mcp-client-manager.test.ts
npx eslint packages/core/src/tools/mcp-client-manager.test.ts
(cd packages/core && npm run typecheck)
git diff --check |
wenshao
left a comment
There was a problem hiding this comment.
Review
Summary
Verified the fix locally: reverted mcp-client-manager.ts to main, the new regression test fails with expected 2 to be 1 at the disconnectCallsBeforeResolve assertion. After applying the patch, all 8 manager tests pass. The race fix lands as advertised.
Approach
Single-flight via Map<string, Promise<void>> keyed by serverName is the right pattern here — granularity allows cross-server work to remain parallel, and the pointer-equality guard in the finally block (get(...) === discoveryPromise) avoids accidentally deleting a successor's promise. The public/private split is clean.
Issues
1. (medium) Undocumented addition of stopHealthCheck(serverName) at the start of discoverMcpToolsForServerInternal
mcp-client-manager.ts:185 (post-patch) adds this.stopHealthCheck(serverName); before the existing-client cleanup. This is a separate, useful bug fix — previously, if rediscovery failed during connect(), the old health timer would keep ticking against a deleted client until the next failure pushed it into reconnectServer. After this change, the timer is stopped immediately.
Worth either calling it out in the PR description or splitting into its own commit; otherwise future git blame on this line won't tell the full story.
2. (minor) Promise.resolve().then(...) wrapper is unnecessary
const discoveryPromise = Promise.resolve().then(() =>
this.discoverMcpToolsForServerInternal(serverName, cliConfig),
);
this.serverDiscoveryPromises.set(serverName, discoveryPromise);discoverMcpToolsForServerInternal is async — calling it directly returns a Promise; the synchronous portion of its body (which contains no recursion back into the public method) runs before the first await. So this is equivalent without the extra microtask hop:
const discoveryPromise = this.discoverMcpToolsForServerInternal(serverName, cliConfig);
this.serverDiscoveryPromises.set(serverName, discoveryPromise);The current form also makes the "unknown server" no-op path require an extra microtask to settle — a small but unnecessary regression.
3. (minor) stop() does not clear serverDiscoveryPromises
If stop() runs while a discovery is in flight, this.clients.clear() happens, but the in-flight code path will later call this.clients.set(serverName, client), leaking a connected client into a stopped manager. This race exists on main too (so not introduced here), but since this PR is already managing in-flight state, it would be natural to either clear the map in stop() or await outstanding promises before clearing clients. Follow-up is fine, not blocking.
4. (minor) Test mock condition is overly complex
mcp-client-manager.test.ts:248-253:
if (
replacementClients.length === 0 &&
vi.mocked(McpClient).mock.calls.length === 1
) {
return firstClient as unknown as McpClient;
}mock.calls.length === 1 already uniquely identifies the first construction; the replacementClients.length === 0 clause is redundant. The existing test on line 199 uses the more idiomatic mockReturnValueOnce(first).mockReturnValueOnce(second) chain — would read more cleanly here too.
5. (note, not a bug) Coalesced caller's cliConfig is silently dropped
The second caller's cliConfig argument never reaches client.discover() — the in-flight call uses the originator's config. Today this is fine because cliConfig rarely differs between rediscovery calls, but worth a one-line comment on the internal method to flag the assumption for future readers.
Risk
- Behavior change is scoped to the "concurrent rediscovery for the same server" window.
discoverAllMcpTools, single-server reconnect (reconnectServer), and cross-server discovery are unaffected. - New test covers the core coalescing scenario plus map cleanup (third call must do real work). Existing 6 tests still pass.
- Not covered: rejection path of the in-flight promise (currently
internalswallows connect/discover errors, so it doesn't reject in practice) and thestop()-during-discovery race noted in #3.
Recommendation
Direction and implementation are right. Suggest addressing #1 and #2 before merge (mention stopHealthCheck somewhere; drop the Promise.resolve().then wrapper). #3/#4/#5 can be follow-ups.
|
Updated this PR for the review feedback:
Local verification:
|
| return; | ||
| } | ||
|
|
||
| this.stopHealthCheck(serverName); |
There was a problem hiding this comment.
[Critical] this.stopHealthCheck(serverName) is called unconditionally at the start of discoverMcpToolsForServerInternal, but this.startHealthCheck(serverName) is only called inside the try success block (line 225 in the full file). If connect() or discover() throws, the health check timer is permanently cleared and auto-reconnect is silently disabled for this server.
This is especially dangerous because this path is triggered by reconnectServer (the health check retry loop), which is designed for infinite retry. One transient failure kills all future automatic retries with no log indicating health monitoring was stopped.
Suggested fix: move this.startHealthCheck(serverName) to the finally block so it runs regardless of success or failure:
try {
await client.connect();
await client.discover(cliConfig);
} catch (error) {
debugLogger.error(...);
} finally {
this.startHealthCheck(serverName);
this.eventEmitter?.emit('mcp-client-update', this.clients);
}— deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
Confirmed — this is a real regression and needs to be fixed before merge.
Pre-PR, the lingering old health-check timer would still fire and eventually retrigger reconnectServer, providing an (unreliable but present) retry safety net. Post-PR, stopHealthCheck runs unconditionally at the start of discoverMcpToolsForServerInternal, but startHealthCheck is only in the success branch — so any transient connect() or discover() failure leaves the server with no health monitoring and no auto-reconnect, silently regressing the very reliability this PR is meant to improve.
Suggested fix: move startHealthCheck(serverName) out of the try branch into the finally block so it runs regardless of outcome. Please also add a regression test asserting the health-check timer is restored after a failed rediscovery (e.g. mock client.connect() to reject, then assert that the next healthCheckTimers entry exists / a follow-up reconnectServer would fire).
Blocking on this before merge.
| private healthCheckTimers: Map<string, NodeJS.Timeout> = new Map(); | ||
| private consecutiveFailures: Map<string, number> = new Map(); | ||
| private isReconnecting: Map<string, boolean> = new Map(); | ||
| private serverDiscoveryPromises: Map<string, Promise<void>> = new Map(); |
There was a problem hiding this comment.
[Suggestion] The new serverDiscoveryPromises map is declared here but has inconsistent cleanup coverage across the class. stop() (line 261) clears consecutiveFailures and isReconnecting but not serverDiscoveryPromises. disconnectServer() (line 283) clears consecutiveFailures and isReconnecting for the server but not its entry in serverDiscoveryPromises. After a stop/restart cycle or server disconnect, stale promise entries could cause subsequent discoverMcpToolsForServer calls to await an already-resolved old promise and silently skip discovery.
Suggested fix:
- In
stop(): addthis.serverDiscoveryPromises.clear(); - In
disconnectServer(): addthis.serverDiscoveryPromises.delete(serverName);
— deepseek-v4-pro via Qwen Code /review
There was a problem hiding this comment.
Valid point — stop() and disconnectServer() should both clean up the new serverDiscoveryPromises map for symmetry with the other lifecycle maps.
Not blocking merge (the leak is bounded to in-flight windows and the finally in the wrapper still cleans up once the promise settles), but please add this as a follow-up:
stop():this.serverDiscoveryPromises.clear();disconnectServer()'sfinally:this.serverDiscoveryPromises.delete(serverName);- A small test for the
stop()-during-discovery race would be nice.
Either fold it into the same fix as #2, or open a follow-up PR — either way works.
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review
|
Updated this PR to address the blocking MCP rediscovery feedback:
Local verification:
|
Local verification (break-fix loop)Reproduced each bug at the version that introduced it, then confirmed the regression tests genuinely catch them:
Reproduction recipe: # Bug repro on main
git checkout main -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
# → 2 failed (coalesce, in-flight cleanup)
# bde105803 regression repro
git checkout bde105803 -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts -t "should restore health checks")
# → 1 failed (health-check restoration)
# Final verification
git checkout HEAD -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
# → 10 passedFull validation on
|
wenshao
left a comment
There was a problem hiding this comment.
Verified locally on dd73312f7: deepseek-v4-pro's #2 (startHealthCheck in finally) and #3 (serverDiscoveryPromises cleanup in stop() / disconnectServer()) are both addressed, with regression coverage for each. Break-fix loop in the comment above demonstrates the new tests genuinely catch the bugs they were written for, and the full core suite (6604 tests) is green.
Approving.
* fix(core): coalesce MCP server rediscovery * test(core): assert MCP rediscovery cleanup * fix(core): address MCP rediscovery review feedback * fix(core): preserve MCP rediscovery health checks --------- Co-authored-by: cyphercodes <cyphercodes@users.noreply.github.com>
* fix(core): coalesce MCP server rediscovery * test(core): assert MCP rediscovery cleanup * fix(core): address MCP rediscovery review feedback * fix(core): preserve MCP rediscovery health checks --------- Co-authored-by: cyphercodes <cyphercodes@users.noreply.github.com>
Summary
Validation
Scope / Risk
Testing Matrix
Testing matrix notes:
Linked Issues / Bugs
Fixes #3817