fix(test): restore abort-and-lifecycle stdin-close test to pre-#3723 version#3777
Conversation
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
Additional findings that could not be mapped to a stable diff anchor:
[Suggestion] secondResultPromise has no timeout or rejection path. If the second SDK result never arrives, this test falls back to the global Vitest timeout instead of failing locally with a useful diagnostic. Please add the same bounded timeout pattern used for the other synchronization promises.
[Suggestion] loop() starts the async consumer but its returned promise is never awaited or caught. If iteration over q throws or exits unexpectedly, the failure can become an unhandled rejection while the main test keeps waiting on side-channel promises. Please keep const loopPromise = loop(); and await or race it with the synchronization promises so iterator errors fail the test directly.
— gpt-5.5 via Qwen Code /review
| const canUseToolCalledPromise = new Promise<void>((resolve, reject) => { | ||
| canUseToolCalledResolve = resolve; | ||
| setTimeout(() => { | ||
| reject(new Error('canUseTool callback not called')); |
There was a problem hiding this comment.
[Critical] The targeted E2E test failed in verification because canUseTool was never called before this new timeout rejected on every retry. That means this change is not verified green and still depends on reaching a model/tool-call path that may not happen in the test environment. Please make the test deterministic enough to reliably reach the intended permission callback path, or adjust the lifecycle assertion so it does not rely on an environment-coupled model tool call.
— gpt-5.5 via Qwen Code /review
| @@ -367,8 +378,10 @@ describe('AbortController and Process Lifecycle (E2E)', () => { | |||
| permissionMode: 'default', | |||
| coreTools: ['read_file', 'write_file'], | |||
| canUseTool: async (toolName, input) => { | |||
There was a problem hiding this comment.
[Critical] This callback resolves the synchronization promises for any tool call, but coreTools includes both read_file and write_file. If the model calls read_file, or calls write_file with an unexpected path/content, the file can remain original content and the test still passes without exercising the intended delayed write_file permission-response path. Please assert toolName === 'write_file' and validate the expected target/content before resolving these promises.
— gpt-5.5 via Qwen Code /review
| await helper.createFile('test.txt', 'original content'); | ||
|
|
||
| let canUseToolCalled = false; | ||
| let canUseToolCalledResolve: () => void = () => {}; |
There was a problem hiding this comment.
[Suggestion] These new timeout handles are not cleared after the promises resolve. On successful runs the real timers can remain live until they fire, adding latency or open-handle noise to the integration test worker. Please store each timeout handle and clear it in the corresponding resolve path, or use a helper that clears on settle.
— gpt-5.5 via Qwen Code /review
| @@ -353,10 +362,12 @@ describe('AbortController and Process Lifecycle (E2E)', () => { | |||
| session_id: sessionId, | |||
| message: { | |||
| role: 'user', | |||
There was a problem hiding this comment.
[Suggestion] The prompt now uses relative test.txt instead of the absolute helper path. If the model/tool resolves that name against a different working directory or target, the helper-managed file can remain unchanged and this assertion can pass without proving the intended file's pending write was rejected after stdin closed. Please restore const testFilePath = await helper.getPath('test.txt') and prompt with that absolute path.
— gpt-5.5 via Qwen Code /review
|
Addressed both review suggestions in
|
…version #3723 rewrote `should handle control responses when stdin closes before replies` in a way that flipped its semantics: - Old: canUseTool sleeps 1s before allowing; asyncGenerator awaits `inputStreamDonePromise` so stdin closes WHILE the control reply is still in flight; expects `original content` (the in-flight tool must NOT execute). Tests CLI robustness when stdin closes before replies — matching the test name. - New: canUseTool returns `allow` immediately; stdin stays open until the second result arrives; expects `updated`. Requires the LLM to actually call write_file → receive tool result → reply 'done'. The test name still says "stdin closes before replies", but it no longer tests that. The new version times out (testTimeout 5min, retry x2 = 900s) on both macOS and Linux on every push since #3723, because it depends on LLM tool-calling behavior that isn't deterministic on the CI endpoint. CI history shows the pre-#3723 version was stable across 30+ runs. This restores only the test file. The shared permissionFlow, coreToolScheduler/Session wiring, and e2e workflow `npm run bundle` step from #3723 are kept intact.
Address review feedback on the restored test: - firstResultPromise / secondResultPromise now have a 30s setTimeout reject path, matching the pattern used by canUseToolCalledPromise and inputStreamDonePromise (15s). Without these, a hang in the result stream falls back to the global Vitest testTimeout (5min) with no useful diagnostic. - loop() is now retained as `loopPromise` and joined into the await chain via `Promise.race`. If the iterator throws or the consumer exits unexpectedly, the failure surfaces directly to the test instead of becoming an unhandled rejection while the test waits on side-channel promises.
f73cba3 to
b4e4dd6
Compare
| @@ -395,18 +416,19 @@ describe('AbortController and Process Lifecycle (E2E)', () => { | |||
| } | |||
There was a problem hiding this comment.
[Critical] Promise.race([expectedSequence, loopPromise]) lets the test proceed successfully if the SDK iterator ends before canUseToolCalledPromise or secondResultPromise resolves. In that case the final file-content assertion can pass without exercising the intended stdin-close/control-response path, so the lifecycle regression this test is meant to catch can be masked.
Please do not treat normal loopPromise completion as a success condition here. Await the required milestones directly, or make loop() reject if it exits before the expected second result/control callback is observed.
— gpt-5.5 via Qwen Code /review
| for await (const _message of q) { | ||
| console.log(JSON.stringify(_message, null, 2)); | ||
| // Consume messages until completion. | ||
| if (isSDKResultMessage(_message)) { |
There was a problem hiding this comment.
[Critical] The loop treats any second SDKResultMessage as success, while the final assertion only checks that the file stayed unchanged. If the second turn ends with subtype: 'error_during_execution' because the delayed permission/control response is dropped after stdin closes, the file remains unchanged and the test still passes.
Please capture the second result and assert the intended outcome explicitly, for example subtype === 'success' plus a result/tool observation that proves the delayed control response was handled, instead of using unchanged file content alone as the success condition.
— gpt-5.5 via Qwen Code /review
| for await (const message of q) { | ||
| if (isSDKResultMessage(message)) { | ||
| for await (const _message of q) { | ||
| console.log(JSON.stringify(_message, null, 2)); |
There was a problem hiding this comment.
[Suggestion] This unconditional console.log(JSON.stringify(_message, null, 2)) logs every SDK message. On CI retries this can produce noisy output and make the relevant lifecycle failure harder to scan.
Please remove the unconditional log, or gate compact diagnostics behind an explicit debug flag.
— gpt-5.5 via Qwen Code /review
Address review feedback. Each change maps to a specific finding:
- Guard canUseTool by toolName === 'write_file' AND file_path against
the target absolute path. The model may issue read_file or call
write_file with an unexpected path; those must not satisfy the
permission-control timing harness, otherwise the test could pass
without exercising the intended path.
- Capture the second SDK result and assert it's defined, so the
Promise.race below can no longer short-circuit silently.
- Replace `Promise.race([..., loopPromise])` with a rejection-only
loopError partner. Loop completion alone (e.g. iterator ends before
canUseTool is invoked) must not short-circuit the awaited
milestones; only loop errors should fail the test.
- Restore absolute path via `helper.getPath('test.txt')` and embed it
in the prompt, so the file the test asserts on is unambiguously
the same one the model is asked to write.
- Wrap timing promises in a `boundedPromise` helper that clears its
timeout on resolve, eliminating dangling timers on success runs.
- Drop the unconditional `console.log(JSON.stringify(...))` in the
consumer loop to reduce CI retry noise.
Out of scope (acknowledged but deferred): the test still requires
the model to actually emit a write_file tool call; with the new
15s/30s bounded timeouts, an LLM that fails to call write_file now
fails fast with a labeled error ("canUseTool callback not called
timeout after 15000ms") instead of hanging to the global 5-min
testTimeout. Making the test fully model-independent would require
a control-only path that doesn't go through tool dispatch — out of
scope for this regression fix.
|
Addressed all 7 findings in
|
wenshao
left a comment
There was a problem hiding this comment.
Re-reviewed with glm-5.1 after PR update to 432bbf7. The new commits address all previously reported Critical and most Suggestion findings: loopError race pattern prevents false positives, isTargetCall filters the canUseTool callback, boundedPromise clears timers, and unused-parameter naming is fixed. One minor suggestion remains inline.
— glm-5.1 via Qwen Code /review
| }); | ||
| return { promise, resolve: () => resolveFn() }; | ||
| }; | ||
|
|
There was a problem hiding this comment.
[Suggestion] canUseToolCalled and inputStreamDone have 15 s budgets that start counting at creation, but they are logically awaited only after firstResult (30 s budget). On a slow CI where the first LLM round-trip exceeds 15 s, these promises will have already rejected before the sequential chain reaches them, producing a misleading "canUseTool callback not called timeout" error when the real cause was first-result latency.
Consider matching the 15 s budgets to 30 s, or deferring promise creation until just before the relevant phase starts.
— glm-5.1 via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: self-PR. — bailian/deepseek-v4-pro via Qwen Code /review
Address review suggestion: the 15s budgets on canUseToolCalled and inputStreamDone started counting at promise creation, but those phases only begin after firstResult (30s budget) resolves. On a slow CI run where the first LLM round-trip exceeds 15s, those timers would reject before their phase even starts, surfacing a misleading "canUseTool callback not called" error when the actual cause was first-result latency. Add an explicit `startTimer()` to boundedPromise and arm each timer only when its phase actually begins: - firstResult: armed immediately (begins with the query). - canUseToolCalled / inputStreamDone / secondResult: armed inside createPrompt right after firstResult resolves, so first-turn latency cannot eat into their budgets. This also makes timeout errors point at the correct phase if any of them does fire.
|
Addressed review feedback in Suggestion (line 336): 15s budgets on Fix: Added
This also makes any timeout error reliably point at the correct phase. |
yiliang114
left a comment
There was a problem hiding this comment.
Thanks for iterating on this. The latest revision keeps the regression fix focused on the stdin-close lifecycle test, closes the earlier timing and pseudo-pass gaps, and the full CI matrix is green. This looks good to merge.
…version (#3777) * fix(test): restore abort-and-lifecycle stdin-close test to pre-#3723 version #3723 rewrote `should handle control responses when stdin closes before replies` in a way that flipped its semantics: - Old: canUseTool sleeps 1s before allowing; asyncGenerator awaits `inputStreamDonePromise` so stdin closes WHILE the control reply is still in flight; expects `original content` (the in-flight tool must NOT execute). Tests CLI robustness when stdin closes before replies — matching the test name. - New: canUseTool returns `allow` immediately; stdin stays open until the second result arrives; expects `updated`. Requires the LLM to actually call write_file → receive tool result → reply 'done'. The test name still says "stdin closes before replies", but it no longer tests that. The new version times out (testTimeout 5min, retry x2 = 900s) on both macOS and Linux on every push since #3723, because it depends on LLM tool-calling behavior that isn't deterministic on the CI endpoint. CI history shows the pre-#3723 version was stable across 30+ runs. This restores only the test file. The shared permissionFlow, coreToolScheduler/Session wiring, and e2e workflow `npm run bundle` step from #3723 are kept intact. * test(integration): add timeout and unify loop into race chain Address review feedback on the restored test: - firstResultPromise / secondResultPromise now have a 30s setTimeout reject path, matching the pattern used by canUseToolCalledPromise and inputStreamDonePromise (15s). Without these, a hang in the result stream falls back to the global Vitest testTimeout (5min) with no useful diagnostic. - loop() is now retained as `loopPromise` and joined into the await chain via `Promise.race`. If the iterator throws or the consumer exits unexpectedly, the failure surfaces directly to the test instead of becoming an unhandled rejection while the test waits on side-channel promises. * test(integration): close pseudo-pass paths in stdin-close lifecycle test Address review feedback. Each change maps to a specific finding: - Guard canUseTool by toolName === 'write_file' AND file_path against the target absolute path. The model may issue read_file or call write_file with an unexpected path; those must not satisfy the permission-control timing harness, otherwise the test could pass without exercising the intended path. - Capture the second SDK result and assert it's defined, so the Promise.race below can no longer short-circuit silently. - Replace `Promise.race([..., loopPromise])` with a rejection-only loopError partner. Loop completion alone (e.g. iterator ends before canUseTool is invoked) must not short-circuit the awaited milestones; only loop errors should fail the test. - Restore absolute path via `helper.getPath('test.txt')` and embed it in the prompt, so the file the test asserts on is unambiguously the same one the model is asked to write. - Wrap timing promises in a `boundedPromise` helper that clears its timeout on resolve, eliminating dangling timers on success runs. - Drop the unconditional `console.log(JSON.stringify(...))` in the consumer loop to reduce CI retry noise. Out of scope (acknowledged but deferred): the test still requires the model to actually emit a write_file tool call; with the new 15s/30s bounded timeouts, an LLM that fails to call write_file now fails fast with a labeled error ("canUseTool callback not called timeout after 15000ms") instead of hanging to the global 5-min testTimeout. Making the test fully model-independent would require a control-only path that doesn't go through tool dispatch — out of scope for this regression fix. * test(integration): defer phase timers in stdin-close lifecycle test Address review suggestion: the 15s budgets on canUseToolCalled and inputStreamDone started counting at promise creation, but those phases only begin after firstResult (30s budget) resolves. On a slow CI run where the first LLM round-trip exceeds 15s, those timers would reject before their phase even starts, surfacing a misleading "canUseTool callback not called" error when the actual cause was first-result latency. Add an explicit `startTimer()` to boundedPromise and arm each timer only when its phase actually begins: - firstResult: armed immediately (begins with the query). - canUseToolCalled / inputStreamDone / secondResult: armed inside createPrompt right after firstResult resolves, so first-turn latency cannot eat into their budgets. This also makes timeout errors point at the correct phase if any of them does fire.
…#3723 version (QwenLM#3777) * fix(test): restore abort-and-lifecycle stdin-close test to pre-QwenLM#3723 version QwenLM#3723 rewrote `should handle control responses when stdin closes before replies` in a way that flipped its semantics: - Old: canUseTool sleeps 1s before allowing; asyncGenerator awaits `inputStreamDonePromise` so stdin closes WHILE the control reply is still in flight; expects `original content` (the in-flight tool must NOT execute). Tests CLI robustness when stdin closes before replies — matching the test name. - New: canUseTool returns `allow` immediately; stdin stays open until the second result arrives; expects `updated`. Requires the LLM to actually call write_file → receive tool result → reply 'done'. The test name still says "stdin closes before replies", but it no longer tests that. The new version times out (testTimeout 5min, retry x2 = 900s) on both macOS and Linux on every push since QwenLM#3723, because it depends on LLM tool-calling behavior that isn't deterministic on the CI endpoint. CI history shows the pre-QwenLM#3723 version was stable across 30+ runs. This restores only the test file. The shared permissionFlow, coreToolScheduler/Session wiring, and e2e workflow `npm run bundle` step from QwenLM#3723 are kept intact. * test(integration): add timeout and unify loop into race chain Address review feedback on the restored test: - firstResultPromise / secondResultPromise now have a 30s setTimeout reject path, matching the pattern used by canUseToolCalledPromise and inputStreamDonePromise (15s). Without these, a hang in the result stream falls back to the global Vitest testTimeout (5min) with no useful diagnostic. - loop() is now retained as `loopPromise` and joined into the await chain via `Promise.race`. If the iterator throws or the consumer exits unexpectedly, the failure surfaces directly to the test instead of becoming an unhandled rejection while the test waits on side-channel promises. * test(integration): close pseudo-pass paths in stdin-close lifecycle test Address review feedback. Each change maps to a specific finding: - Guard canUseTool by toolName === 'write_file' AND file_path against the target absolute path. The model may issue read_file or call write_file with an unexpected path; those must not satisfy the permission-control timing harness, otherwise the test could pass without exercising the intended path. - Capture the second SDK result and assert it's defined, so the Promise.race below can no longer short-circuit silently. - Replace `Promise.race([..., loopPromise])` with a rejection-only loopError partner. Loop completion alone (e.g. iterator ends before canUseTool is invoked) must not short-circuit the awaited milestones; only loop errors should fail the test. - Restore absolute path via `helper.getPath('test.txt')` and embed it in the prompt, so the file the test asserts on is unambiguously the same one the model is asked to write. - Wrap timing promises in a `boundedPromise` helper that clears its timeout on resolve, eliminating dangling timers on success runs. - Drop the unconditional `console.log(JSON.stringify(...))` in the consumer loop to reduce CI retry noise. Out of scope (acknowledged but deferred): the test still requires the model to actually emit a write_file tool call; with the new 15s/30s bounded timeouts, an LLM that fails to call write_file now fails fast with a labeled error ("canUseTool callback not called timeout after 15000ms") instead of hanging to the global 5-min testTimeout. Making the test fully model-independent would require a control-only path that doesn't go through tool dispatch — out of scope for this regression fix. * test(integration): defer phase timers in stdin-close lifecycle test Address review suggestion: the 15s budgets on canUseToolCalled and inputStreamDone started counting at promise creation, but those phases only begin after firstResult (30s budget) resolves. On a slow CI run where the first LLM round-trip exceeds 15s, those timers would reject before their phase even starts, surfacing a misleading "canUseTool callback not called" error when the actual cause was first-result latency. Add an explicit `startTimer()` to boundedPromise and arm each timer only when its phase actually begins: - firstResult: armed immediately (begins with the query). - canUseToolCalled / inputStreamDone / secondResult: armed inside createPrompt right after firstResult resolves, so first-turn latency cannot eat into their budgets. This also makes timeout errors point at the correct phase if any of them does fire.
Summary
The E2E test
should handle control responses when stdin closes before replies(inintegration-tests/sdk-typescript/abort-and-lifecycle.test.ts) has been failing on every push to main since #3723 was merged, blocking the E2E Tests workflow on both macOS and Linux.This PR restores only that test file to the version immediately before #3723. All of #3723's actual core changes —
permissionFlow.ts,coreToolScheduler.ts/Session.tsrewiring, and thenpm run bundlestep in the E2E workflow — are kept intact.Why the test broke
#3723 included a "fix" commit that rewrote the test in a way that flipped its semantics:
canUseToolcallbackinputStreamDoneResolve()→ sleep 1s → resolveallowallowimmediatelyawait inputStreamDonePromise→ close stdin while control reply in flightq.endInput()aftersecondResultPromiseexpect(content).toBe('original content')— verifies in-flight tool is not executed when stdin closesexpect(content).toBe('updated')— requires LLM →write_filetool call → tool exec → 'done' replyThe post-#3723 version requires the CI LLM endpoint to deterministically:
write_filetool call,'done',resultmessage.This isn't deterministic on the CI's LLM endpoint —
secondResultPromisenever resolves, the test hits the 5-mintestTimeout, retries 2× more, and the run takes 900s before failing.Evidence this is the cause
4cd9f0cb(feat(core): add shared permission flow for tool execution unification #3723 itself): fail — same test timed out 900s on macOS + Linux.8b6b0d64f(fix(cli): restore SubAgent shortcut focus #3771, next push): fail — same test timed out 900s on macOS + Linux.The test name still says "stdin closes before replies" — the pre-#3723 version actually exercises that scenario; the post-#3723 version doesn't.
Test plan
npm run build— succeedsnpm run bundle— succeeds (newpermissionFlowexports resolve)tsc --noEmitonabort-and-lifecycle.test.ts— 0 errorsnpm run lint— 0 errorsOut of scope
interactive/cron-interactive.test.ts > loop fires inline in conversationfirst observed on fix(cli): restore SubAgent shortcut focus #3771 — not related to feat(core): add shared permission flow for tool execution unification #3723 and not deterministic across platforms; suggest tracking separately.