Skip to content

Commit 99d18b4

Browse files
zhangxin840claude
andcommitted
fix(feishu): gate onConfirmedConnect on readyState OPEN (P2 nonce-delay fix)
The SDK's reconnect nonce delay can be up to 30 s. During this window `lastConnectTime` is already non-zero (set by the failed initial attempt) and stable, so FEISHU_WS_STABLE_TICKS_REQUIRED (2 ticks ≈ 20 s) is reached before the nonce expires. This triggered `onConfirmedConnect` falsely, setting `hadSuccessfulConnect = true` even though no connection was established — causing boot-time failures to reset `supervisorAttempt` to 0 and preventing exponential backoff escalation. Fix: gate `onConfirmedConnect` on the same `getWsReadyState()` check added in the previous commit. If `readyState === 1` (OPEN) or is not observable (null, for tests/future SDK changes — backward-compatible fallback), fire the callback. If readyState is known but not OPEN (e.g. 0=CONNECTING during nonce wait), defer confirmation until the socket is actually open. `getWsReadyState()` is extracted as a shared helper used by both the confirmed-connect path and the post-reconnect stall-clock clear path. Regression test added: "does not confirm connection during nonce delay" — verifies that readyState CONNECTING blocks onConfirmedConnect and allows supervisorAttempt to increment on subsequent stall. Addresses Codex P2 in #57978 (2026-04-06 review comment on line 267 of monitor.transport.ts). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent b05a832 commit 99d18b4

2 files changed

Lines changed: 112 additions & 31 deletions

File tree

extensions/feishu/src/monitor.cleanup.test.ts

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,71 @@ describe("feishu websocket cleanup", () => {
324324
vi.useRealTimers();
325325
});
326326

327+
it("does not confirm connection during nonce delay (P2 regression — readyState gate)", async () => {
328+
// Scenario: initial connect attempt fires (lastConnectTime advances once),
329+
// then the SDK enters its reconnect nonce delay (up to 30 s). During that
330+
// window lastConnectTime is stable and non-zero, but the socket is not OPEN
331+
// (readyState = 0 CONNECTING). onConfirmedConnect must NOT fire, so
332+
// hadSuccessfulConnect stays false and supervisorAttempt increments on stall.
333+
vi.useFakeTimers();
334+
335+
const accountId = "alpha";
336+
337+
let connectTime = 0;
338+
// readyState starts at 0 (CONNECTING) — simulates the SDK in nonce delay.
339+
const mockWsInstance = { readyState: 0 /* CONNECTING */ };
340+
const firstClient: MockWsClient = {
341+
start: vi.fn(),
342+
close: vi.fn(),
343+
getReconnectInfo: vi.fn(() => ({ lastConnectTime: connectTime })),
344+
wsConfig: { getWSInstance: vi.fn(() => mockWsInstance) },
345+
};
346+
347+
const abortController = new AbortController();
348+
const secondClient = createWsClient();
349+
secondClient.getReconnectInfo.mockReturnValue({ lastConnectTime: 0 });
350+
351+
createFeishuWSClientMock.mockReturnValueOnce(firstClient).mockReturnValueOnce(secondClient);
352+
353+
const monitorPromise = monitorWebSocket({
354+
account: createAccount(accountId),
355+
accountId,
356+
runtime: { log: vi.fn(), error: vi.fn(), exit: vi.fn() },
357+
abortSignal: abortController.signal,
358+
eventDispatcher: {} as never,
359+
});
360+
361+
// Tick 0: lastConnectTime = 0, no-op.
362+
await vi.advanceTimersByTimeAsync(10_000);
363+
364+
// Initial connect attempt fires — lastConnectTime advances.
365+
connectTime = Date.now();
366+
await vi.advanceTimersByTimeAsync(10_000); // connectChangeCount = 1, stable = 0
367+
368+
// Nonce delay: lastConnectTime stable for 2 ticks, but readyState = CONNECTING.
369+
await vi.advanceTimersByTimeAsync(10_000); // stable = 1
370+
await vi.advanceTimersByTimeAsync(10_000); // stable = 2 → readyState CONNECTING → no confirm
371+
372+
// Nonce expires; SDK fires reconnect attempt.
373+
connectTime = Date.now() + 1;
374+
await vi.advanceTimersByTimeAsync(10_000); // connectChangeCount = 2, staleSinceMs set
375+
376+
// Freeze — no more attempts. Advance 90 s to trip the stall.
377+
await vi.advanceTimersByTimeAsync(90_000);
378+
379+
// onConfirmedConnect did not fire (readyState was CONNECTING, not OPEN),
380+
// so hadSuccessfulConnect = false and supervisorAttempt incremented.
381+
expect(computeBackoffMock).toHaveBeenCalled();
382+
const firstCall = computeBackoffMock.mock.calls[0] as unknown as [unknown, number] | undefined;
383+
expect(firstCall?.[1] ?? 0).toBeGreaterThanOrEqual(1);
384+
385+
abortController.abort();
386+
await vi.runAllTimersAsync();
387+
await monitorPromise;
388+
389+
vi.useRealTimers();
390+
});
391+
327392
it("clears stall clock after successful reconnect — no false eviction (P1 regression)", async () => {
328393
// Scenario: initial connect → reconnect attempt (stall clock starts) →
329394
// reconnect succeeds (WS readyState OPEN + lastConnectTime stable for ≥2 ticks)

extensions/feishu/src/monitor.transport.ts

Lines changed: 47 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,20 @@ function runFeishuWSClientUntilDead(params: {
201201
}): Promise<"aborted" | "stalled"> {
202202
const { wsClient, eventDispatcher, accountId, log, abortSignal, onConfirmedConnect } = params;
203203

204+
// Read the underlying WebSocket's readyState via the SDK's private wsConfig
205+
// shape. Returns the numeric state (0–3) when available, or null when the
206+
// accessor is absent (e.g. in tests without a full mock, or after a future
207+
// SDK refactor). Callers treat null as "unknown — fall back to stable ticks."
208+
function getWsReadyState(): number | null {
209+
try {
210+
// eslint-disable-next-line @typescript-eslint/no-explicit-any
211+
const wsInstance = (wsClient as any).wsConfig?.getWSInstance?.();
212+
return wsInstance != null ? (wsInstance.readyState as number) : null;
213+
} catch {
214+
return null;
215+
}
216+
}
217+
204218
return new Promise<"aborted" | "stalled">((resolve) => {
205219
if (abortSignal?.aborted) {
206220
resolve("aborted");
@@ -257,14 +271,25 @@ function runFeishuWSClientUntilDead(params: {
257271

258272
// lastConnectTime is unchanged this tick.
259273

260-
// Accumulate stable ticks only when the SDK has actually recorded a
261-
// connect attempt (lastConnectTime non-zero) and the stall clock has not
262-
// yet fired (connection looks healthy on the initial connect path).
274+
// Accumulate stable ticks on the initial-connect path (staleSinceMs null).
275+
// Fire onConfirmedConnect once we have enough stable evidence AND the
276+
// WebSocket is actually open.
277+
//
278+
// The readyState gate prevents a false positive during the SDK's reconnect
279+
// nonce delay (up to 30 s): after a failed initial connect, lastConnectTime
280+
// is already non-zero and stable during the nonce wait, but the socket is
281+
// not OPEN yet. If readyState is not observable (null), we fall back to
282+
// stable-tick-only confirmation to preserve backward compatibility.
263283
if (!confirmedConnectFired && lastConnectTime !== 0 && staleSinceMs === null) {
264284
connectedAndStableCount += 1;
265285
if (connectedAndStableCount >= FEISHU_WS_STABLE_TICKS_REQUIRED) {
266-
confirmedConnectFired = true;
267-
onConfirmedConnect?.();
286+
const readyState = getWsReadyState();
287+
if (readyState === null || readyState === 1 /* WebSocket.OPEN */) {
288+
confirmedConnectFired = true;
289+
onConfirmedConnect?.();
290+
}
291+
// readyState is known but not OPEN (e.g. CONNECTING during nonce wait) —
292+
// do not fire; check again next tick without resetting the counter.
268293
}
269294
}
270295

@@ -274,38 +299,29 @@ function runFeishuWSClientUntilDead(params: {
274299
}
275300

276301
// A reconnect attempt was recorded (staleSinceMs set). Before declaring
277-
// a stall, check whether the underlying WebSocket is actually open — this
278-
// is the only signal that reliably distinguishes a fast successful
279-
// reconnect (readyState === OPEN → clear stall clock) from SDK retry
280-
// exhaustion (CLOSED/CONNECTING → keep the clock).
302+
// a stall, check whether the underlying WebSocket is actually open — the
303+
// only signal that reliably distinguishes a fast successful reconnect
304+
// (readyState OPEN → clear stall clock) from SDK retry exhaustion
305+
// (CLOSED/CONNECTING → keep the clock running).
281306
//
282307
// We wait for FEISHU_WS_STABLE_TICKS_REQUIRED consecutive stable ticks
283-
// before checking, to avoid a false positive while the SDK is still
284-
// mid-handshake after the reconnect attempt.
285-
//
286-
// wsConfig is accessed via the SDK's runtime shape (not the public type).
287-
// The whole block is guarded so that any future SDK refactor degrades
288-
// gracefully to the existing stall-detection behaviour instead of throwing.
308+
// before checking, to avoid clearing prematurely while a handshake is
309+
// still in progress. If readyState is not observable we leave the stall
310+
// clock running (conservative: prefer a false restart over a silent death).
289311
if (lastConnectTime !== 0) {
290312
connectedAndStableCount += 1;
291313
if (connectedAndStableCount >= FEISHU_WS_STABLE_TICKS_REQUIRED) {
292-
try {
293-
// eslint-disable-next-line @typescript-eslint/no-explicit-any
294-
const wsInstance = (wsClient as any).wsConfig?.getWSInstance?.();
295-
if (wsInstance != null && wsInstance.readyState === 1 /* WebSocket.OPEN */) {
296-
// WebSocket is open — reconnect confirmed. Clear the stall clock
297-
// so a healthy recovered session is not falsely evicted after 90 s.
298-
log(
299-
`feishu[${accountId}]: WebSocket reconnect confirmed (readyState OPEN); ` +
300-
`clearing stall clock`,
301-
);
302-
staleSinceMs = null;
303-
connectedAndStableCount = 0;
304-
return;
305-
}
306-
} catch {
307-
// wsConfig private API unavailable; fall through to normal stall check.
314+
const readyState = getWsReadyState();
315+
if (readyState === 1 /* WebSocket.OPEN */) {
316+
log(
317+
`feishu[${accountId}]: WebSocket reconnect confirmed (readyState OPEN); ` +
318+
`clearing stall clock`,
319+
);
320+
staleSinceMs = null;
321+
connectedAndStableCount = 0;
322+
return;
308323
}
324+
// readyState is CLOSED/CONNECTING/null — stall clock continues.
309325
}
310326
}
311327

0 commit comments

Comments
 (0)