fix(core): coalesce MCP server rediscovery by cyphercodes · Pull Request #3818 · QwenLM/qwen-code

cyphercodes · 2026-05-04T04:17:09Z

Summary

What changed: Overlapping MCP server rediscovery requests for the same server now share the in-flight restart instead of starting independent replacement clients; rediscovery also stops the old health-check timer before replacing a server connection.
Why it changed: Concurrent reconnect/restart paths could each try to replace the same server connection and leave an extra MCP process running without a tracked client.
Reviewer focus: Please check that same-server rediscovery is coalesced while normal single-server replacement behavior remains unchanged.

Validation

Commands run:

npm ci --ignore-scripts
cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts
cd ../..
npm run build --workspace=packages/core
npm run typecheck
npx prettier --check packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
npx eslint packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts
git diff --check HEAD~1..HEAD

Prompts / inputs used: N/A — internal lifecycle regression coverage.
Expected result: Concurrent same-server rediscovery should perform one replacement flow and create one replacement client.
Observed result: The focused core Vitest file passes with 8 tests, including the new concurrent rediscovery regression. Build, typecheck, format, lint, and whitespace checks pass locally.

Quickest reviewer verification path:

cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts

Evidence: Before the fix, the new regression failed because two concurrent rediscovery calls both tried to disconnect/replace the same client. After the fix, the regression and existing manager tests pass.

Scope / Risk

Main risk or tradeoff: A duplicate concurrent request now waits for the in-flight rediscovery instead of forcing an immediate second restart. This is intended to avoid duplicate MCP processes.
Not covered / not validated: No full end-to-end subprocess demo; validation is focused on the manager lifecycle behavior.
Breaking changes / migration notes: None.

Testing Matrix

	🍏	🪟	🐧
npm run	⚠️	⚠️	✅
npx	⚠️	⚠️	✅
Docker	N/A	N/A	N/A
Podman	N/A	N/A	N/A
Seatbelt	N/A	N/A	N/A

Testing matrix notes:

Local validation was run on Linux only. Docker/Podman/Seatbelt were not applicable to this internal lifecycle fix.

Linked Issues / Bugs

Fixes #3817

wenshao · 2026-05-04T05:00:08Z

+
+    expect(disconnectCallsBeforeResolve).toBe(1);
+    expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(2);
+    expect(replacementClients).toHaveLength(1);


[Suggestion] The concurrent coalesce test doesn't verify that the serverDiscoveryPromises map is cleaned up after completion. If the finally block cleanup were ever broken by a future refactor, the existing test wouldn't catch it — a stale resolved promise would silently block all subsequent discovery for that server.

Consider adding a third call after the concurrent pair resolves, asserting it creates a fresh client rather than being coalesced into a stale promise:

Suggested change

expect(replacementClients).toHaveLength(1);

resolveDisconnect();

await Promise.all([firstRediscovery, secondRediscovery]);

expect(disconnectCallsBeforeResolve).toBe(1);

expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(2);

expect(replacementClients).toHaveLength(1);

expect(replacementClients[0].connect).toHaveBeenCalledOnce();

expect(replacementClients[0].discover).toHaveBeenCalledOnce();

// Verify map was cleaned up: a third call should do real work,

// not get coalesced into a stale promise.

const thirdRediscovery = manager.discoverMcpToolsForServer(

'test-server',

{} as unknown as Config,

);

await thirdRediscovery;

expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(3);

expect(replacementClients).toHaveLength(2);

expect(replacementClients[1].connect).toHaveBeenCalledOnce();

expect(replacementClients[1].discover).toHaveBeenCalledOnce();

— deepseek-v4-pro via Qwen Code /review

cyphercodes · 2026-05-04T05:19:35Z

Implemented the follow-up test coverage: after the concurrent rediscovery pair completes, the test now performs a third rediscovery and asserts it creates/connects/discovers a fresh MCP client, which would fail if the discovery promise map kept a stale resolved entry.

Verification (from repo root):

npm ci --ignore-scripts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
npx prettier --check packages/core/src/tools/mcp-client-manager.test.ts
npx eslint packages/core/src/tools/mcp-client-manager.test.ts
(cd packages/core && npm run typecheck)
git diff --check

wenshao

Review

Summary

Verified the fix locally: reverted mcp-client-manager.ts to main, the new regression test fails with expected 2 to be 1 at the disconnectCallsBeforeResolve assertion. After applying the patch, all 8 manager tests pass. The race fix lands as advertised.

Approach

Single-flight via Map<string, Promise<void>> keyed by serverName is the right pattern here — granularity allows cross-server work to remain parallel, and the pointer-equality guard in the finally block (get(...) === discoveryPromise) avoids accidentally deleting a successor's promise. The public/private split is clean.

Issues

1. (medium) Undocumented addition of stopHealthCheck(serverName) at the start of discoverMcpToolsForServerInternal

mcp-client-manager.ts:185 (post-patch) adds this.stopHealthCheck(serverName); before the existing-client cleanup. This is a separate, useful bug fix — previously, if rediscovery failed during connect(), the old health timer would keep ticking against a deleted client until the next failure pushed it into reconnectServer. After this change, the timer is stopped immediately.

Worth either calling it out in the PR description or splitting into its own commit; otherwise future git blame on this line won't tell the full story.

2. (minor) Promise.resolve().then(...) wrapper is unnecessary

const discoveryPromise = Promise.resolve().then(() =>
  this.discoverMcpToolsForServerInternal(serverName, cliConfig),
);
this.serverDiscoveryPromises.set(serverName, discoveryPromise);

discoverMcpToolsForServerInternal is async — calling it directly returns a Promise; the synchronous portion of its body (which contains no recursion back into the public method) runs before the first await. So this is equivalent without the extra microtask hop:

const discoveryPromise = this.discoverMcpToolsForServerInternal(serverName, cliConfig);
this.serverDiscoveryPromises.set(serverName, discoveryPromise);

The current form also makes the "unknown server" no-op path require an extra microtask to settle — a small but unnecessary regression.

3. (minor) stop() does not clear serverDiscoveryPromises

If stop() runs while a discovery is in flight, this.clients.clear() happens, but the in-flight code path will later call this.clients.set(serverName, client), leaking a connected client into a stopped manager. This race exists on main too (so not introduced here), but since this PR is already managing in-flight state, it would be natural to either clear the map in stop() or await outstanding promises before clearing clients. Follow-up is fine, not blocking.

4. (minor) Test mock condition is overly complex

mcp-client-manager.test.ts:248-253:

if (
  replacementClients.length === 0 &&
  vi.mocked(McpClient).mock.calls.length === 1
) {
  return firstClient as unknown as McpClient;
}

mock.calls.length === 1 already uniquely identifies the first construction; the replacementClients.length === 0 clause is redundant. The existing test on line 199 uses the more idiomatic mockReturnValueOnce(first).mockReturnValueOnce(second) chain — would read more cleanly here too.

5. (note, not a bug) Coalesced caller's cliConfig is silently dropped

The second caller's cliConfig argument never reaches client.discover() — the in-flight call uses the originator's config. Today this is fine because cliConfig rarely differs between rediscovery calls, but worth a one-line comment on the internal method to flag the assumption for future readers.

Risk

Behavior change is scoped to the "concurrent rediscovery for the same server" window. discoverAllMcpTools, single-server reconnect (reconnectServer), and cross-server discovery are unaffected.
New test covers the core coalescing scenario plus map cleanup (third call must do real work). Existing 6 tests still pass.
Not covered: rejection path of the in-flight promise (currently internal swallows connect/discover errors, so it doesn't reject in practice) and the stop()-during-discovery race noted in #3.

Recommendation

Direction and implementation are right. Suggest addressing #1 and #2 before merge (mention stopHealthCheck somewhere; drop the Promise.resolve().then wrapper). #3/#4/#5 can be follow-ups.

cyphercodes · 2026-05-05T07:33:28Z

Updated this PR for the review feedback:

Removed the unnecessary Promise.resolve().then(...) wrapper around rediscovery.
Simplified the concurrent rediscovery test mock condition.
Updated the PR summary to explicitly mention stopping the prior health-check timer during rediscovery.

Local verification:

npm ci --ignore-scripts ✅
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts) ✅
npx prettier --check packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts ✅
npx eslint packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts ✅
npm run typecheck --workspace=packages/core ✅
git diff --check ✅

wenshao · 2026-05-05T08:44:30Z

      return;
    }

+    this.stopHealthCheck(serverName);


[Critical] this.stopHealthCheck(serverName) is called unconditionally at the start of discoverMcpToolsForServerInternal, but this.startHealthCheck(serverName) is only called inside the try success block (line 225 in the full file). If connect() or discover() throws, the health check timer is permanently cleared and auto-reconnect is silently disabled for this server.

This is especially dangerous because this path is triggered by reconnectServer (the health check retry loop), which is designed for infinite retry. One transient failure kills all future automatic retries with no log indicating health monitoring was stopped.

Suggested fix: move this.startHealthCheck(serverName) to the finally block so it runs regardless of success or failure:

try { await client.connect(); await client.discover(cliConfig); } catch (error) { debugLogger.error(...); } finally { this.startHealthCheck(serverName); this.eventEmitter?.emit('mcp-client-update', this.clients); }

— deepseek-v4-pro via Qwen Code /review

Confirmed — this is a real regression and needs to be fixed before merge.

Pre-PR, the lingering old health-check timer would still fire and eventually retrigger reconnectServer, providing an (unreliable but present) retry safety net. Post-PR, stopHealthCheck runs unconditionally at the start of discoverMcpToolsForServerInternal, but startHealthCheck is only in the success branch — so any transient connect() or discover() failure leaves the server with no health monitoring and no auto-reconnect, silently regressing the very reliability this PR is meant to improve.

Suggested fix: move startHealthCheck(serverName) out of the try branch into the finally block so it runs regardless of outcome. Please also add a regression test asserting the health-check timer is restored after a failed rediscovery (e.g. mock client.connect() to reject, then assert that the next healthCheckTimers entry exists / a follow-up reconnectServer would fire).

Blocking on this before merge.

wenshao · 2026-05-05T08:44:30Z

  private healthCheckTimers: Map<string, NodeJS.Timeout> = new Map();
  private consecutiveFailures: Map<string, number> = new Map();
  private isReconnecting: Map<string, boolean> = new Map();
+  private serverDiscoveryPromises: Map<string, Promise<void>> = new Map();


[Suggestion] The new serverDiscoveryPromises map is declared here but has inconsistent cleanup coverage across the class. stop() (line 261) clears consecutiveFailures and isReconnecting but not serverDiscoveryPromises. disconnectServer() (line 283) clears consecutiveFailures and isReconnecting for the server but not its entry in serverDiscoveryPromises. After a stop/restart cycle or server disconnect, stale promise entries could cause subsequent discoverMcpToolsForServer calls to await an already-resolved old promise and silently skip discovery.

Suggested fix:

In stop(): add this.serverDiscoveryPromises.clear();

In disconnectServer(): add this.serverDiscoveryPromises.delete(serverName);

— deepseek-v4-pro via Qwen Code /review

Valid point — stop() and disconnectServer() should both clean up the new serverDiscoveryPromises map for symmetry with the other lifecycle maps.

Not blocking merge (the leak is bounded to in-flight windows and the finally in the wrapper still cleans up once the promise settles), but please add this as a follow-up:

stop(): this.serverDiscoveryPromises.clear();

disconnectServer()'s finally: this.serverDiscoveryPromises.delete(serverName);

A small test for the stop()-during-discovery race would be nice.

Either fold it into the same fix as #2, or open a follow-up PR — either way works.

wenshao

No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review

cyphercodes · 2026-05-05T09:59:30Z

Updated this PR to address the blocking MCP rediscovery feedback:

Restores the server health-check timer in finally after single-server rediscovery, so transient connect() / discover() failures do not permanently disable auto-reconnect.
Clears serverDiscoveryPromises during stop() / disconnectServer() cleanup.
Added regression coverage for failed rediscovery health-check restoration and stop-time discovery cleanup.

Local verification:

npm ci --ignore-scripts ✅
cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts ✅
npx prettier --check packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts ✅
npx eslint packages/core/src/tools/mcp-client-manager.ts packages/core/src/tools/mcp-client-manager.test.ts ✅
cd packages/core && npm run typecheck ✅
git diff --check ✅

wenshao · 2026-05-05T10:30:15Z

Local verification (break-fix loop)

Reproduced each bug at the version that introduced it, then confirmed the regression tests genuinely catch them:

Production code	Failing regression test	Failure mode
`main` (pre-PR)	`should coalesce concurrent discovery for the same server`	`expected 2 to be 1` at `disconnectCallsBeforeResolve` — concurrent rediscovery each invoked `disconnect()` independently, matching the leaked-subprocess symptom from #3817
`bde105803` (first fix attempt)	`should restore health checks after failed server rediscovery`	`expected false to be true` — `connect()` rejection left `healthCheckTimers` empty, silently disabling auto-reconnect
`dd73312f7` (HEAD)	—	10/10 manager tests pass

Reproduction recipe:

# Bug repro on main
git checkout main -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
# → 2 failed (coalesce, in-flight cleanup)

# bde105803 regression repro
git checkout bde105803 -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts -t "should restore health checks")
# → 1 failed (health-check restoration)

# Final verification
git checkout HEAD -- packages/core/src/tools/mcp-client-manager.ts
(cd packages/core && npx vitest run src/tools/mcp-client-manager.test.ts)
# → 10 passed

Full validation on `dd73312f7`

npm ci --ignore-scripts ✅ 1215 packages
Focused vitest on mcp-client-manager.test.ts ✅ 10/10
Full core vitest ✅ 261 files / 6604 passed / 4 skipped (no regressions)
npm run typecheck ✅ across core, sdk, webui
npx prettier --check on changed files ✅
npx eslint on changed files ✅
git diff --check main...HEAD ✅

Direction, implementation, and tests all check out. Merging.

wenshao

Verified locally on dd73312f7: deepseek-v4-pro's #2 (startHealthCheck in finally) and #3 (serverDiscoveryPromises cleanup in stop() / disconnectServer()) are both addressed, with regression coverage for each. Break-fix loop in the comment above demonstrates the new tests genuinely catch the bugs they were written for, and the full core suite (6604 tests) is green.

Approving.

* fix(core): coalesce MCP server rediscovery * test(core): assert MCP rediscovery cleanup * fix(core): address MCP rediscovery review feedback * fix(core): preserve MCP rediscovery health checks --------- Co-authored-by: cyphercodes <cyphercodes@users.noreply.github.com>

fix(core): coalesce MCP server rediscovery

0dbb947

wenshao reviewed May 4, 2026

View reviewed changes

test(core): assert MCP rediscovery cleanup

f2e19a3

wenshao reviewed May 5, 2026

View reviewed changes

fix(core): address MCP rediscovery review feedback

bde1058

wenshao requested changes May 5, 2026

View reviewed changes

wenshao previously approved these changes May 5, 2026

View reviewed changes

fix(core): preserve MCP rediscovery health checks

dd73312

cyphercodes dismissed wenshao’s stale review via dd73312 May 5, 2026 09:59

wenshao approved these changes May 5, 2026

View reviewed changes

wenshao merged commit ec51fd3 into QwenLM:main May 5, 2026
13 checks passed

This was referenced May 5, 2026

fix(core): prevent duplicate MCP processes from concurrent discovery #3819

Closed

Daemon mode (qwen serve): proposal & open decisions #3803

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): coalesce MCP server rediscovery#3818

fix(core): coalesce MCP server rediscovery#3818
wenshao merged 4 commits into
QwenLM:mainfrom
cyphercodes:fix/mcp-client-discovery-race

cyphercodes commented May 4, 2026 •

edited

Loading

Uh oh!

wenshao May 4, 2026

Uh oh!

cyphercodes commented May 4, 2026

Uh oh!

wenshao left a comment

Uh oh!

cyphercodes commented May 5, 2026

Uh oh!

wenshao May 5, 2026

Uh oh!

wenshao May 5, 2026

Uh oh!

wenshao May 5, 2026

Uh oh!

wenshao May 5, 2026

Uh oh!

wenshao left a comment

Uh oh!

cyphercodes commented May 5, 2026

Uh oh!

wenshao commented May 5, 2026

Uh oh!

wenshao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    expect(replacementClients).toHaveLength(1);
+    resolveDisconnect();
+    await Promise.all([firstRediscovery, secondRediscovery]);
+    expect(disconnectCallsBeforeResolve).toBe(1);
+    expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(2);
+    expect(replacementClients).toHaveLength(1);
+    expect(replacementClients[0].connect).toHaveBeenCalledOnce();
+    expect(replacementClients[0].discover).toHaveBeenCalledOnce();
+    // Verify map was cleaned up: a third call should do real work,
+    // not get coalesced into a stale promise.
+    const thirdRediscovery = manager.discoverMcpToolsForServer(
+      'test-server',
+      {} as unknown as Config,
+    );
+    await thirdRediscovery;
+    expect(vi.mocked(McpClient)).toHaveBeenCalledTimes(3);
+    expect(replacementClients).toHaveLength(2);
+    expect(replacementClients[1].connect).toHaveBeenCalledOnce();
+    expect(replacementClients[1].discover).toHaveBeenCalledOnce();

Conversation

cyphercodes commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Scope / Risk

Testing Matrix

Linked Issues / Bugs

Uh oh!

wenshao May 4, 2026

Choose a reason for hiding this comment

Uh oh!

cyphercodes commented May 4, 2026

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Review

Summary

Approach

Issues

Risk

Recommendation

Uh oh!

cyphercodes commented May 5, 2026

Uh oh!

wenshao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao May 5, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

cyphercodes commented May 5, 2026

Uh oh!

wenshao commented May 5, 2026

Local verification (break-fix loop)

Full validation on dd73312f7

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyphercodes commented May 4, 2026 •

edited

Loading

Full validation on `dd73312f7`