Skip to content

[C++] Repartitioning a large dataset with 3 variables uses all my RAM and kills the process #40224

@thisisnic

Description

@thisisnic

Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to repartition a ~10Gb dataset based on a new variable, but I can't work out whether this is a bug or expected behaviour due to how things are implemented internally. Here's the R code I've been running:

open_dataset("data/pums/person") |>
  mutate(
    age_group = case_when(
      AGEP < 25 ~ "Under 25",
      AGEP < 35 ~ "25-34",
      AGEP < 45 ~ "35-44",
      AGEP < 55 ~ "45-54",
      AGEP < 65 ~ "55-64",
      TRUE ~ "65+"
    )
  )|>
  write_dataset(
    path = "./data/pums/person-age-partitions",
    partitioning = c("year", "location", "age_group")
  )

The data is in Parquet format and is already partitioned by "year" and "location". When I try to run this, it gradually uses more and more of my RAM until it crashes.

If I run it with the debugger attached, it all looks fine, but eventually dies with the message Program terminated with signal SIGKILL, Killed.

This is using Arrow C++ library version 14.0.2, and R package version 14.0.2.1

When I try this with a different variable that already exists in the dataset, it uses a lot of RAM, but seems to back off before it gets too high, e.g.

open_dataset("data/pums/person") |>
  write_dataset("data/pums/person-cow-partition", partitioning = c("year", "COW"))

I think I'm missing something here in terms of what's going on, rather than this being a bug?

Component(s)

C++

Metadata

Metadata

Assignees

Labels

Component: C++Critical FixBugfixes for security vulnerabilities, crashes, or invalid data.Type: usageIssue is a user question

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions