-
Notifications
You must be signed in to change notification settings - Fork 679
test: attempt to stabilize socket mode unit tests #2198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #2198 +/- ##
=======================================
Coverage 91.99% 91.99%
=======================================
Files 38 38
Lines 10386 10386
Branches 657 657
=======================================
Hits 9555 9555
Misses 819 819
Partials 12 12
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
zimeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WilliamBergamin This is such a promising change! Right now this test flakes >8% of the time... 🚀
I'm excited for it, but left one comment about adding an extra assert with this change to avoid possible regressions of retries without waiting.
That's not so blocking, but I think it's worth a quick discussion since the exacts of this test have been somewhat confusing to me in this meantime 👾
| const start_time = Date.now(); | ||
| let retries = 0; | ||
| do { | ||
| await sleep(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 A shorter sleep seems great for more accurate timings!
| do { | ||
| await sleep(2); | ||
| retries = closed; | ||
| } while (retries < 2 && Date.now() - start_time < 50); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if 2 retries are detected within 50ms then we can consider the unit test to be successful.
🤔 I'm not sure if breaking on both of these conditions still tests what we're hoping for without other changes, but please let me know if I'm thinking about this odd!
IMO the break makes sense to avoid stalled tests, but we might also want to check that the end time is within a certain range in later asserts? Something between 40ms to 50ms makes sense to me as a start?
But it fails to account for the overhead of setting up the time out and executing retry requests, if this overhead takes more then 10ms a race condition it created between the unit test the socketModeClient and the mock server.
AFAICT the exact timing of this test before changes were made was a big cause of flakes. I think with the above assert we can remain confident that time between retires is still respected without increasing the test duration 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this proposition 💯 this was something I originally considered and figured it may not be needed
But I believe we need to assert against a range of 25ms to 50ms, on my local machine the time elapse for this task seems to be between 29 and 35ms, which is not what I expected 🤔
zimeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for increasing our test confidences around this!
The 25ms lower bound is so interesting but also makes sense with the other setups happening within client.start after the timer starts.
Finding a time below 40ms is most interesting though! I wonder if you've uncovered the cause of flakes 👁️🗨️
In either case, I'm hoping this resolves flakes of this test. If not, we should revisit this later!
Co-authored-by: Eden Zimbelman <eden.zimbelman@salesforce.com>
Summary
Aims to fix #2138
Since #2187 I've notices that flaky tests error with the following log
1) Integration tests with a WebSocket server unexpected socket messages sent to client should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409: unexpected number of times `close` event was raised during reconnection! + expected - actual -3 +2 at Context.<anonymous> (test/integration.spec.js:217:14) at processTicksAndRejections (node:internal/process/task_queues:95:5)should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409 is a real time system unit test, this means that real time elapse dictate the behavior of the unit test. Javascript in combination with event driven design patterns can be used for real time system applications but the implementation of the socket mode client may not be ideal for this.
Regardless, the tests sets clientPingTimeout to
20msand then waits50msto expect 2 retry attempts. But it fails to account for the overhead of setting up the time out and executing retry requests, if this overhead takes more then10msa race condition it created between the unit test the socketModeClient and the mock server.In order to stabilize the test we could increase the
clientPingTimeoutand let the test wait longer, but this increases the time elapsed to run the test. To avoid this, this PR aims to poll the number of retries, if 2 retries are detected within50msthen we can consider the unit test to be successful.If we see that this change does not improve the stability of the test in our CI pipeline we can revisit it, see this PR as an experiment.
Requirements (place an
xin each[ ])