Conversation
|
@hendrikmakait is this still a draft PR, or do we want to get this one going? |
|
@ncclementi: I still have to check why the tests don't run on CI; likely the pytest worker gets OOM-killed. |
|
@ncclementi: As it turns out, one issue with the tests was that the partition size was very small (~8 MiB). I've adjusted the tests now to use larger partitions, let's see what CI says. |
|
I've also dropped the 10x join since it would mean spilling 10-20x the cluster memory to disk which would slow tests down significantly. We could (and should) add the occasional BIG data integration test run. |
|
|
This makes sense. Hopefully, when things get fixed we should see a nice drop in the memory. |
| # Control cardinality on column to join - this produces cardinality ~ to len(df) | ||
| df2_big["x2"] = df2_big["x"] * 1e9 | ||
| df2_big = df2_big.astype({"x2": "int"}) | ||
| df2_big["predicate"] = df2_big["0"] * 1e9 |
There was a problem hiding this comment.
Out of curiosity, why did you choose the name "predicate"?
There was a problem hiding this comment.
That's me coming from a DB background. I wanted a name that's more descriptive in this context than x2 and since it's the column used in the join predicate (i.e., the expression used to merge the tables/dataframes), that's what I ended up with. This could also be merge_col or something like that if you find that easier to understand.
There was a problem hiding this comment.
I see. That's good, no need to change it, it was more out of curiosity and I just learn something new :)
I'm merging this in.
|
Thanks @hendrikmakait, let's open a separate issue to track integration tests for the bigger case. |
p2pintest_join.py#641