Enable larger join tests by hendrikmakait · Pull Request #645 · coiled/benchmarks

hendrikmakait · 2023-01-03T19:20:28Z

Closes Enable larger joins for p2p in test_join.py #641

ncclementi · 2023-01-11T16:34:13Z

@hendrikmakait is this still a draft PR, or do we want to get this one going?

hendrikmakait · 2023-01-11T16:51:26Z

@ncclementi: I still have to check why the tests don't run on CI; likely the pytest worker gets OOM-killed.

hendrikmakait · 2023-01-20T16:12:23Z

@ncclementi: As it turns out, one issue with the tests was that the partition size was very small (~8 MiB). I've adjusted the tests now to use larger partitions, let's see what CI says.

hendrikmakait · 2023-01-20T16:54:12Z

I've also dropped the 10x join since it would mean spilling 10-20x the cluster memory to disk which would slow tests down significantly. We could (and should) add the occasional BIG data integration test run.

hendrikmakait · 2023-01-23T19:48:20Z

test_join_big shows unnecessarily high memory usage with p2p due (dask/distributed#7496). It is still performing better, so I count that as a win.

test_join_big_small does not show any effect due to early materialization of the smaller dataframe which circumvents a distributed join (#669).

ncclementi · 2023-01-23T22:34:22Z

test_join_big shows unnecessarily high memory usage with p2p due (dask/distributed#7496). It is still performing better, so I count that as a win.

This makes sense. Hopefully, when things get fixed we should see a nice drop in the memory.

ncclementi

This LGTM, the failure is unrelated.
Only comment is that we should probably add a separate issue to track the case for size=10 as part of integration tests, and then mentioned it on this PR just for completion.

ncclementi · 2023-01-23T22:35:30Z

tests/benchmarks/test_join.py

        # Control cardinality on column to join - this produces cardinality ~ to len(df)
-        df2_big["x2"] = df2_big["x"] * 1e9
-        df2_big = df2_big.astype({"x2": "int"})
+        df2_big["predicate"] = df2_big["0"] * 1e9


Out of curiosity, why did you choose the name "predicate"?

That's me coming from a DB background. I wanted a name that's more descriptive in this context than x2 and since it's the column used in the join predicate (i.e., the expression used to merge the tables/dataframes), that's what I ended up with. This could also be merge_col or something like that if you find that easier to understand.

I see. That's good, no need to change it, it was more out of curiosity and I just learn something new :)
I'm merging this in.

ncclementi · 2023-01-24T15:05:05Z

Thanks @hendrikmakait, let's open a separate issue to track integration tests for the bigger case.

hendrikmakait added 2 commits January 3, 2023 20:16

Enable larger join tests

17a0ecd

Merge branch 'main' into enable-larger-p2p-join-tests

3f16bdb

hendrikmakait self-assigned this Jan 4, 2023

Merge branch 'main' into enable-larger-p2p-join-tests

fe968b5

Adjust join tests

92be440

Remove 10x merge

8b4517d

Fix

e00dcab

hendrikmakait marked this pull request as ready for review January 23, 2023 09:55

hendrikmakait marked this pull request as draft January 23, 2023 13:53

hendrikmakait mentioned this pull request Jan 23, 2023

P2P shuffling and queuing combined may cause high memory usage with dask.dataframe.merge dask/distributed#7496

Closed

hendrikmakait marked this pull request as ready for review January 23, 2023 19:48

hendrikmakait requested a review from ncclementi January 23, 2023 19:48

ncclementi approved these changes Jan 23, 2023

View reviewed changes

ncclementi merged commit 8da363e into main Jan 24, 2023

hendrikmakait deleted the enable-larger-p2p-join-tests branch March 21, 2024 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable larger join tests#645

Enable larger join tests#645
ncclementi merged 6 commits intomainfrom
enable-larger-p2p-join-tests

hendrikmakait commented Jan 3, 2023

Uh oh!

ncclementi commented Jan 11, 2023

Uh oh!

hendrikmakait commented Jan 11, 2023

Uh oh!

hendrikmakait commented Jan 20, 2023

Uh oh!

hendrikmakait commented Jan 20, 2023 •

edited

Loading

Uh oh!

hendrikmakait commented Jan 23, 2023

Uh oh!

ncclementi commented Jan 23, 2023

Uh oh!

ncclementi left a comment •

edited

Loading

Uh oh!

ncclementi Jan 23, 2023

Uh oh!

hendrikmakait Jan 24, 2023 •

edited

Loading

Uh oh!

ncclementi Jan 24, 2023

Uh oh!

ncclementi commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hendrikmakait commented Jan 3, 2023

Uh oh!

ncclementi commented Jan 11, 2023

Uh oh!

hendrikmakait commented Jan 11, 2023

Uh oh!

hendrikmakait commented Jan 20, 2023

Uh oh!

hendrikmakait commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hendrikmakait commented Jan 23, 2023

Uh oh!

ncclementi commented Jan 23, 2023

Uh oh!

ncclementi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi Jan 23, 2023

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

ncclementi commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hendrikmakait commented Jan 20, 2023 •

edited

Loading

ncclementi left a comment •

edited

Loading

hendrikmakait Jan 24, 2023 •

edited

Loading