Skip to content

Conversation

@WilliamBergamin
Copy link
Contributor

@WilliamBergamin WilliamBergamin commented Mar 21, 2025

Summary

Aims to fix #2138

Since #2187 I've notices that flaky tests error with the following log

  1) Integration tests with a WebSocket server
       unexpected socket messages sent to client
         should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409:

      unexpected number of times `close` event was raised during reconnection!
      + expected - actual

      -3
      +2
      
      at Context.<anonymous> (test/integration.spec.js:217:14)
      at processTicksAndRejections (node:internal/process/task_queues:95:5)

should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409 is a real time system unit test, this means that real time elapse dictate the behavior of the unit test. Javascript in combination with event driven design patterns can be used for real time system applications but the implementation of the socket mode client may not be ideal for this.

Regardless, the tests sets clientPingTimeout to 20ms and then waits 50ms to expect 2 retry attempts. But it fails to account for the overhead of setting up the time out and executing retry requests, if this overhead takes more then 10ms a race condition it created between the unit test the socketModeClient and the mock server.

In order to stabilize the test we could increase the clientPingTimeout and let the test wait longer, but this increases the time elapsed to run the test. To avoid this, this PR aims to poll the number of retries, if 2 retries are detected within 50ms then we can consider the unit test to be successful.

If we see that this change does not improve the stability of the test in our CI pipeline we can revisit it, see this PR as an experiment.

Requirements (place an x in each [ ])

@WilliamBergamin WilliamBergamin added tests M-T: Testing work only pkg:socket-mode applies to `@slack/socket-mode` labels Mar 21, 2025
@WilliamBergamin WilliamBergamin requested a review from zimeg March 21, 2025 14:16
@WilliamBergamin WilliamBergamin self-assigned this Mar 21, 2025
@codecov
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.99%. Comparing base (0101781) to head (26f0b6e).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2198   +/-   ##
=======================================
  Coverage   91.99%   91.99%           
=======================================
  Files          38       38           
  Lines       10386    10386           
  Branches      657      657           
=======================================
  Hits         9555     9555           
  Misses        819      819           
  Partials       12       12           
Flag Coverage Δ
cli-hooks 95.23% <ø> (ø)
cli-test 94.76% <ø> (ø)
oauth 77.39% <ø> (ø)
socket-mode 61.82% <ø> (ø)
web-api 96.93% <ø> (ø)
webhook 96.65% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@zimeg zimeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WilliamBergamin This is such a promising change! Right now this test flakes >8% of the time... 🚀

I'm excited for it, but left one comment about adding an extra assert with this change to avoid possible regressions of retries without waiting.

That's not so blocking, but I think it's worth a quick discussion since the exacts of this test have been somewhat confusing to me in this meantime 👾

const start_time = Date.now();
let retries = 0;
do {
await sleep(2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 A shorter sleep seems great for more accurate timings!

do {
await sleep(2);
retries = closed;
} while (retries < 2 && Date.now() - start_time < 50);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if 2 retries are detected within 50ms then we can consider the unit test to be successful.

🤔 I'm not sure if breaking on both of these conditions still tests what we're hoping for without other changes, but please let me know if I'm thinking about this odd!

IMO the break makes sense to avoid stalled tests, but we might also want to check that the end time is within a certain range in later asserts? Something between 40ms to 50ms makes sense to me as a start?

But it fails to account for the overhead of setting up the time out and executing retry requests, if this overhead takes more then 10ms a race condition it created between the unit test the socketModeClient and the mock server.

AFAICT the exact timing of this test before changes were made was a big cause of flakes. I think with the above assert we can remain confident that time between retires is still respected without increasing the test duration 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this proposition 💯 this was something I originally considered and figured it may not be needed

But I believe we need to assert against a range of 25ms to 50ms, on my local machine the time elapse for this task seems to be between 29 and 35ms, which is not what I expected 🤔

@WilliamBergamin WilliamBergamin requested a review from zimeg March 24, 2025 19:36
Copy link
Member

@zimeg zimeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for increasing our test confidences around this!

The 25ms lower bound is so interesting but also makes sense with the other setups happening within client.start after the timer starts.

Finding a time below 40ms is most interesting though! I wonder if you've uncovered the cause of flakes 👁️‍🗨️

In either case, I'm hoping this resolves flakes of this test. If not, we should revisit this later!

@WilliamBergamin WilliamBergamin merged commit d3fe084 into main Mar 24, 2025
57 checks passed
@WilliamBergamin WilliamBergamin deleted the attempt-to-stabalize-unit-tests branch March 24, 2025 19:49
@zimeg zimeg mentioned this pull request Mar 26, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pkg:socket-mode applies to `@slack/socket-mode` tests M-T: Testing work only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

health: flaky socket mode unit tests

3 participants