-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++] Repartitioning a large dataset with 3 variables uses all my RAM and kills the process #40224
Description
Describe the bug, including details regarding any error messages, version, and platform.
I'm trying to repartition a ~10Gb dataset based on a new variable, but I can't work out whether this is a bug or expected behaviour due to how things are implemented internally. Here's the R code I've been running:
open_dataset("data/pums/person") |>
mutate(
age_group = case_when(
AGEP < 25 ~ "Under 25",
AGEP < 35 ~ "25-34",
AGEP < 45 ~ "35-44",
AGEP < 55 ~ "45-54",
AGEP < 65 ~ "55-64",
TRUE ~ "65+"
)
)|>
write_dataset(
path = "./data/pums/person-age-partitions",
partitioning = c("year", "location", "age_group")
)The data is in Parquet format and is already partitioned by "year" and "location". When I try to run this, it gradually uses more and more of my RAM until it crashes.
If I run it with the debugger attached, it all looks fine, but eventually dies with the message Program terminated with signal SIGKILL, Killed.
This is using Arrow C++ library version 14.0.2, and R package version 14.0.2.1
When I try this with a different variable that already exists in the dataset, it uses a lot of RAM, but seems to back off before it gets too high, e.g.
open_dataset("data/pums/person") |>
write_dataset("data/pums/person-cow-partition", partitioning = c("year", "COW"))I think I'm missing something here in terms of what's going on, rather than this being a bug?
Component(s)
C++