Skip to content

fix(core): coalesce MCP server rediscovery#3818

Merged
wenshao merged 4 commits into
QwenLM:mainfrom
cyphercodes:fix/mcp-client-discovery-race
May 5, 2026
Merged

fix(core): coalesce MCP server rediscovery#3818
wenshao merged 4 commits into
QwenLM:mainfrom
cyphercodes:fix/mcp-client-discovery-race

Conversation

@cyphercodes

@cyphercodes cyphercodes commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • What changed: Overlapping MCP server rediscovery requests for the same server now share the in-flight restart instead of starting independent replacement clients; rediscovery also stops the old health-check timer before replacing a server connection.
  • Why it changed: Concurrent reconnect/restart paths could each try to replace the same server connection and leave an extra MCP process running without a tracked client.
  • Reviewer focus: Please check that same-server rediscovery is coalesced while normal single-server replacement behavior remains unchanged.

Validation

  • Commands run:
    npm ci --ignore-scripts
    cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts
    cd ../..
    npm run build --workspace=packages/core
    npm run typecheck
    npx prettier --check packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
    npx eslint packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
    git diff --check HEAD~1..HEAD
  • Prompts / inputs used: N/A — internal lifecycle regression coverage.
  • Expected result: Concurrent same-server rediscovery should perform one replacement flow and create one replacement client.
  • Observed result: The focused core Vitest file passes with 8 tests, including the new concurrent rediscovery regression. Build, typecheck, format, lint, and whitespace checks pass locally.
  • Quickest reviewer verification path:
    cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts
  • Evidence: Before the fix, the new regression failed because two concurrent rediscovery calls both tried to disconnect/replace the same client. After the fix, the regression and existing manager tests pass.

Scope / Risk

  • Main risk or tradeoff: A duplicate concurrent request now waits for the in-flight rediscovery instead of forcing an immediate second restart. This is intended to avoid duplicate MCP processes.
  • Not covered / not validated: No full end-to-end subprocess demo; validation is focused on the manager lifecycle behavior.
  • Breaking changes / migration notes: None.

Testing Matrix

🍏 🪟 🐧
npm run ⚠️ ⚠️
npx ⚠️ ⚠️
Docker N/A N/A N/A
Podman N/A N/A N/A
Seatbelt N/A N/A N/A

Testing matrix notes:

  • Local validation was run on Linux only. Docker/Podman/Seatbelt were not applicable to this internal lifecycle fix.

Linked Issues / Bugs

Fixes #3817


expect(disconnectCallsBeforeResolve).toBe(1);
expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(2);
expect(replacementClients).toHaveLength(1);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The concurrent coalesce test doesn't verify that the serverDiscoveryPromises map is cleaned up after completion. If the finally block cleanup were ever broken by a future refactor, the existing test wouldn't catch it — a stale resolved promise would silently block all subsequent discovery for that server.

Consider adding a third call after the concurrent pair resolves, asserting it creates a fresh client rather than being coalesced into a stale promise:

Suggested change
expect(replacementClients).toHaveLength(1);
resolveDisconnect();
await Promise.all([firstRediscovery, secondRediscovery]);
expect(disconnectCallsBeforeResolve).toBe(1);
expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(2);
expect(replacementClients).toHaveLength(1);
expect(replacementClients[0].connect).toHaveBeenCalledOnce();
expect(replacementClients[0].discover).toHaveBeenCalledOnce();
// Verify map was cleaned up: a third call should do real work,
// not get coalesced into a stale promise.
const thirdRediscovery = manager.discoverMcpToolsForServer(
'test-server',
{} as unknown as Config,
);
await thirdRediscovery;
expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(3);
expect(replacementClients).toHaveLength(2);
expect(replacementClients[1].connect).toHaveBeenCalledOnce();
expect(replacementClients[1].discover).toHaveBeenCalledOnce();

— deepseek-v4-pro via Qwen Code /review

@cyphercodes

Copy link
Copy Markdown
Contributor Author

Implemented the follow-up test coverage: after the concurrent rediscovery pair completes, the test now performs a third rediscovery and asserts it creates/connects/discovers a fresh MCP client, which would fail if the discovery promise map kept a stale resolved entry.

Verification (from repo root):

npm ci --ignore-scripts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
npx prettier --check packages/core/src/tools/mcp-client-manager.test.ts
npx eslint packages/core/src/tools/mcp-client-manager.test.ts
(cd packages/core && npm run typecheck)
git diff --check

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Summary

Verified the fix locally: reverted mcp-client-manager.ts to main, the new regression test fails with expected 2 to be 1 at the disconnectCallsBeforeResolve assertion. After applying the patch, all 8 manager tests pass. The race fix lands as advertised.

Approach

Single-flight via Map<string, Promise<void>> keyed by serverName is the right pattern here — granularity allows cross-server work to remain parallel, and the pointer-equality guard in the finally block (get(...) === discoveryPromise) avoids accidentally deleting a successor's promise. The public/private split is clean.

Issues

1. (medium) Undocumented addition of stopHealthCheck(serverName) at the start of discoverMcpToolsForServerInternal

mcp-client-manager.ts:185 (post-patch) adds this.stopHealthCheck(serverName); before the existing-client cleanup. This is a separate, useful bug fix — previously, if rediscovery failed during connect(), the old health timer would keep ticking against a deleted client until the next failure pushed it into reconnectServer. After this change, the timer is stopped immediately.

Worth either calling it out in the PR description or splitting into its own commit; otherwise future git blame on this line won't tell the full story.

2. (minor) Promise.resolve().then(...) wrapper is unnecessary

const discoveryPromise = Promise.resolve().then(() =>
  this.discoverMcpToolsForServerInternal(serverName, cliConfig),
);
this.serverDiscoveryPromises.set(serverName, discoveryPromise);

discoverMcpToolsForServerInternal is async — calling it directly returns a Promise; the synchronous portion of its body (which contains no recursion back into the public method) runs before the first await. So this is equivalent without the extra microtask hop:

const discoveryPromise = this.discoverMcpToolsForServerInternal(serverName, cliConfig);
this.serverDiscoveryPromises.set(serverName, discoveryPromise);

The current form also makes the "unknown server" no-op path require an extra microtask to settle — a small but unnecessary regression.

3. (minor) stop() does not clear serverDiscoveryPromises

If stop() runs while a discovery is in flight, this.clients.clear() happens, but the in-flight code path will later call this.clients.set(serverName, client), leaking a connected client into a stopped manager. This race exists on main too (so not introduced here), but since this PR is already managing in-flight state, it would be natural to either clear the map in stop() or await outstanding promises before clearing clients. Follow-up is fine, not blocking.

4. (minor) Test mock condition is overly complex

mcp-client-manager.test.ts:248-253:

if (
  replacementClients.length === 0 &&
  vi.mocked(McpClient).mock.calls.length === 1
) {
  return firstClient as unknown as McpClient;
}

mock.calls.length === 1 already uniquely identifies the first construction; the replacementClients.length === 0 clause is redundant. The existing test on line 199 uses the more idiomatic mockReturnValueOnce(first).mockReturnValueOnce(second) chain — would read more cleanly here too.

5. (note, not a bug) Coalesced caller's cliConfig is silently dropped

The second caller's cliConfig argument never reaches client.discover() — the in-flight call uses the originator's config. Today this is fine because cliConfig rarely differs between rediscovery calls, but worth a one-line comment on the internal method to flag the assumption for future readers.

Risk

  • Behavior change is scoped to the "concurrent rediscovery for the same server" window. discoverAllMcpTools, single-server reconnect (reconnectServer), and cross-server discovery are unaffected.
  • New test covers the core coalescing scenario plus map cleanup (third call must do real work). Existing 6 tests still pass.
  • Not covered: rejection path of the in-flight promise (currently internal swallows connect/discover errors, so it doesn't reject in practice) and the stop()-during-discovery race noted in #3.

Recommendation

Direction and implementation are right. Suggest addressing #1 and #2 before merge (mention stopHealthCheck somewhere; drop the Promise.resolve().then wrapper). #3/#4/#5 can be follow-ups.

@cyphercodes

Copy link
Copy Markdown
Contributor Author

Updated this PR for the review feedback:

  • Removed the unnecessary Promise.resolve().then(...) wrapper around rediscovery.
  • Simplified the concurrent rediscovery test mock condition.
  • Updated the PR summary to explicitly mention stopping the prior health-check timer during rediscovery.

Local verification:

  • npm ci --ignore-scripts
  • (cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
  • npx prettier --check packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
  • npx eslint packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
  • npm run typecheck --workspace=packages/core
  • git diff --check

return;
}

this.stopHealthCheck(serverName);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] this.stopHealthCheck(serverName) is called unconditionally at the start of discoverMcpToolsForServerInternal, but this.startHealthCheck(serverName) is only called inside the try success block (line 225 in the full file). If connect() or discover() throws, the health check timer is permanently cleared and auto-reconnect is silently disabled for this server.

This is especially dangerous because this path is triggered by reconnectServer (the health check retry loop), which is designed for infinite retry. One transient failure kills all future automatic retries with no log indicating health monitoring was stopped.

Suggested fix: move this.startHealthCheck(serverName) to the finally block so it runs regardless of success or failure:

try {
  await client.connect();
  await client.discover(cliConfig);
} catch (error) {
  debugLogger.error(...);
} finally {
  this.startHealthCheck(serverName);
  this.eventEmitter?.emit('mcp-client-update', this.clients);
}

— deepseek-v4-pro via Qwen Code /review

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed — this is a real regression and needs to be fixed before merge.

Pre-PR, the lingering old health-check timer would still fire and eventually retrigger reconnectServer, providing an (unreliable but present) retry safety net. Post-PR, stopHealthCheck runs unconditionally at the start of discoverMcpToolsForServerInternal, but startHealthCheck is only in the success branch — so any transient connect() or discover() failure leaves the server with no health monitoring and no auto-reconnect, silently regressing the very reliability this PR is meant to improve.

Suggested fix: move startHealthCheck(serverName) out of the try branch into the finally block so it runs regardless of outcome. Please also add a regression test asserting the health-check timer is restored after a failed rediscovery (e.g. mock client.connect() to reject, then assert that the next healthCheckTimers entry exists / a follow-up reconnectServer would fire).

Blocking on this before merge.

private healthCheckTimers: Map<string, NodeJS.Timeout> = new Map();
private consecutiveFailures: Map<string, number> = new Map();
private isReconnecting: Map<string, boolean> = new Map();
private serverDiscoveryPromises: Map<string, Promise<void>> = new Map();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The new serverDiscoveryPromises map is declared here but has inconsistent cleanup coverage across the class. stop() (line 261) clears consecutiveFailures and isReconnecting but not serverDiscoveryPromises. disconnectServer() (line 283) clears consecutiveFailures and isReconnecting for the server but not its entry in serverDiscoveryPromises. After a stop/restart cycle or server disconnect, stale promise entries could cause subsequent discoverMcpToolsForServer calls to await an already-resolved old promise and silently skip discovery.

Suggested fix:

  • In stop(): add this.serverDiscoveryPromises.clear();
  • In disconnectServer(): add this.serverDiscoveryPromises.delete(serverName);

— deepseek-v4-pro via Qwen Code /review

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point — stop() and disconnectServer() should both clean up the new serverDiscoveryPromises map for symmetry with the other lifecycle maps.

Not blocking merge (the leak is bounded to in-flight windows and the finally in the wrapper still cleans up once the promise settles), but please add this as a follow-up:

  • stop(): this.serverDiscoveryPromises.clear();
  • disconnectServer()'s finally: this.serverDiscoveryPromises.delete(serverName);
  • A small test for the stop()-during-discovery race would be nice.

Either fold it into the same fix as #2, or open a follow-up PR — either way works.

wenshao
wenshao previously approved these changes May 5, 2026

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review

@cyphercodes

Copy link
Copy Markdown
Contributor Author

Updated this PR to address the blocking MCP rediscovery feedback:

  • Restores the server health-check timer in finally after single-server rediscovery, so transient connect() / discover() failures do not permanently disable auto-reconnect.
  • Clears serverDiscoveryPromises during stop() / disconnectServer() cleanup.
  • Added regression coverage for failed rediscovery health-check restoration and stop-time discovery cleanup.

Local verification:

  • npm ci --ignore-scripts
  • cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts
  • npx prettier --check packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
  • npx eslint packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
  • cd packages/core && npm run typecheck
  • git diff --check

@wenshao

wenshao commented May 5, 2026

Copy link
Copy Markdown
Collaborator

Local verification (break-fix loop)

Reproduced each bug at the version that introduced it, then confirmed the regression tests genuinely catch them:

Production code Failing regression test Failure mode
main (pre-PR) should coalesce concurrent discovery for the same server expected 2 to be 1 at disconnectCallsBeforeResolve — concurrent rediscovery each invoked disconnect() independently, matching the leaked-subprocess symptom from #3817
bde105803 (first fix attempt) should restore health checks after failed server rediscovery expected false to be trueconnect() rejection left healthCheckTimers empty, silently disabling auto-reconnect
dd73312f7 (HEAD) 10/10 manager tests pass

Reproduction recipe:

# Bug repro on main
git checkout main -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
# → 2 failed (coalesce, in-flight cleanup)

# bde105803 regression repro
git checkout bde105803 -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts -t "should restore health checks")
# → 1 failed (health-check restoration)

# Final verification
git checkout HEAD -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
# → 10 passed

Full validation on dd73312f7

  • npm ci --ignore-scripts ✅ 1215 packages
  • Focused vitest on mcp-client-manager.test.ts ✅ 10/10
  • Full core vitest ✅ 261 files / 6604 passed / 4 skipped (no regressions)
  • npm run typecheck ✅ across core, sdk, webui
  • npx prettier --check on changed files ✅
  • npx eslint on changed files ✅
  • git diff --check main...HEAD

Direction, implementation, and tests all check out. Merging.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified locally on dd73312f7: deepseek-v4-pro's #2 (startHealthCheck in finally) and #3 (serverDiscoveryPromises cleanup in stop() / disconnectServer()) are both addressed, with regression coverage for each. Break-fix loop in the comment above demonstrates the new tests genuinely catch the bugs they were written for, and the full core suite (6604 tests) is green.

Approving.

@wenshao wenshao merged commit ec51fd3 into QwenLM:main May 5, 2026
13 checks passed
DragonnZhang pushed a commit that referenced this pull request May 8, 2026
* fix(core): coalesce MCP server rediscovery

* test(core): assert MCP rediscovery cleanup

* fix(core): address MCP rediscovery review feedback

* fix(core): preserve MCP rediscovery health checks

---------

Co-authored-by: cyphercodes <cyphercodes@users.noreply.github.com>
xaelistic pushed a commit to xaelistic/qwen-code that referenced this pull request Jun 7, 2026
* fix(core): coalesce MCP server rediscovery

* test(core): assert MCP rediscovery cleanup

* fix(core): address MCP rediscovery review feedback

* fix(core): preserve MCP rediscovery health checks

---------

Co-authored-by: cyphercodes <cyphercodes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in McpClientManager creates duplicate MCP processes

2 participants