fix(test): count result messages instead of assistant messages in multi-model E2E test#4341
Conversation
…ti-model E2E test Each turn produces multiple assistant messages (thinking + text), but the test was counting each one as a separate turn. This caused the third setModel to fire before the third turn actually started, so only 2 system messages were captured instead of the expected 3. Resolve turn-completion promises on isSDKResultMessage (one per turn) instead of isSDKAssistantMessage.
📋 Review SummaryThis PR fixes a flaky E2E test by correcting how turn completion is counted. The root cause analysis is excellent — the test was counting every assistant message (thinking + text per turn) instead of result messages (one per turn), causing the turn counter to inflate and assertions to fail. The fix is minimal, surgical, and directly addresses the identified problem. 🔍 General Feedback
🎯 Specific FeedbackNo specific issues identified in this review.The fix is correct, well-documented, and appropriately scoped. The approach of resolving turn-completion promises on ✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: self-PR. LGTM! ✅
— DeepSeek/deepseek-v4-pro via Qwen Code /review
pomelo-nwu
left a comment
There was a problem hiding this comment.
LGTM. Accurate root-cause: reasoning models emit two assistant messages per turn (thinking + text), so counting assistant messages overshoots the turn counter and desyncs the setModel sequence. Switching the turn-completion signal to isSDKResultMessage (exactly one per turn) is the correct, deterministic fix. Doesn't affect the adjacent streaming test (boolean dedup). Test-only change, CI green on Linux/macOS/Windows.
Problem
The E2E test
should handle multiple model changes in sequenceinsystem-control.test.tsfails consistently on both Linux and macOS CI:Root Cause
The test counts every
isSDKAssistantMessageas one completed turn, but each turn actually produces multiple assistant messages — athinkingmessage followed by atextmessage. This inflates the turn counter:resolvers[0]thensetModel(qwen3-turbo)resolvers[1]thensetModel(qwen3-vl-plus)resolvers[2]then checksystemMessages.lengthAt the assertion point only 2 turns have completed (2 system messages), but the test expects 3. The third turn never even starts.
Fix
Resolve turn-completion promises on
isSDKResultMessage(exactly one per turn) instead ofisSDKAssistantMessage(multiple per turn). TheresultWaiter.notifyResult()was already called on result messages — we now also advance the turn counter there.Reviewer Test Plan
should handle multiple model changes in sequencepasses in thesdk-typescript/system-control.test.tsE2E suite.should change model dynamically during streaming inputstill passes (it was already using a different dedup strategy and is unaffected).