-
Notifications
You must be signed in to change notification settings - Fork 5k
test(test-runner): test round-robin sharding with stable-test-runner #33032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(test-runner): test round-robin sharding with stable-test-runner #33032
Conversation
Test results for "tests 1"3 flaky35869 passed, 620 skipped Merge workflow run. |
|
Do I understand it right that if the tests had beforeAll / worker fixture setups (as in real e2e tests), we would observe an opposite effect? |
|
I'm referring to this part: By default, we recommend using fullyParallel. This means that for the tests with non-trivial hooks we would do much more work in case of round-robin as we'll be running those hooks 3x more times. Making sure we are on the same page wrt the idea behind your change. |
|
I think beforeAll kinda breaks fully-parallel as it causes tests in a spec file to be executed in the same test group, but in general you are right, round-robin can cause worker scoped hooks and fixtures to be executed more times than the current sharding algorithm. And setup / teardown times of worker scoped fixtures will probably not be included in test run times, so the duration-round-robin algorithm can not account for those yet… In our use-case, we are not making use of beforeAll, but instead have a heavy global setup. With global setup there is no difference between the sharding modes, as it's executed simply once per shard (not worker) regardless of the tests which will be run per shard. There clearly is no silver bullet. However, I think that this solution gives the playwright user more flexibility to try-out and choose whatever works best for them. |
Going back to my original sentiment, that's exactly why we are hesitant landing the PR. We are looking for the ways to cover more use cases (with hooks) with a smaller api surface / maintenance cost. For example, allow test lists for shards so that users could use third party solutions to tune their exact configuration. Or allow a callback that would take over scheduling tests. Developing those to be easy in maintenance requires consideration and time that we can't currently allocate to the problem. But we are very open to keeping this communication in case a nice proposal comes up. |
|
Would you be open to an approach that allows users to implement their own test filter? …basically similar to grep, but instead of RegEx use a function that takes a list of test infos and returns a filtered list... That would actually allow users to implement their own sharding logic. |
I think it is worth exploring. The other option is a test list file. We know that people want test lists, but we are struggling with committing to a persistent test id that would be used in those (it becomes a part of our contract). Lists are also sub-optimal for those interested in sharding as shards have different lists. But many more customers are interested in custom failure retries where test lists are very useful, so it might be that committing it worth it. Callback approach is potentially simpler as we can use the reporter data types there, but it might hit some rocks as we would allow users specifying the order of the tests and that might result in suboptimal worker process alignment. Just thinking out loud here. In either case, I like having lower-level primitives that allow for greater flexibility for power users to tune Playwright to their definition of perfection. Much more than having a handful of suboptimal presets that will only work for a couple of customers. |
|
Okay, sounds good. I just looked into filtering and implemented a first version: #33049 |
This PR contains the changes of #30962 applied as patch to the stable-test-runner.
Here's a comparison of the run times for the tests in CI…
The times above are from these test runs…