[C++] Repartitioning a large dataset with 3 variables uses all my RAM and kills the process

### Describe the bug, including details regarding any error messages, version, and platform.

I'm trying to repartition a ~10Gb dataset based on a new variable, but I can't work out whether this is a bug or expected behaviour due to how things are implemented internally.  Here's the R code I've been running:

```r
open_dataset("data/pums/person") |>
  mutate(
    age_group = case_when(
      AGEP < 25 ~ "Under 25",
      AGEP < 35 ~ "25-34",
      AGEP < 45 ~ "35-44",
      AGEP < 55 ~ "45-54",
      AGEP < 65 ~ "55-64",
      TRUE ~ "65+"
    )
  )|>
  write_dataset(
    path = "./data/pums/person-age-partitions",
    partitioning = c("year", "location", "age_group")
  )
```

The data is in Parquet format and is already partitioned by "year" and "location".  When I try to run this, it gradually uses more and more of my RAM until it crashes.

If I run it with the debugger attached, it all looks fine, but eventually dies with the message `Program terminated with signal SIGKILL, Killed.`

This is using Arrow C++ library version 14.0.2, and R package version 14.0.2.1

When I try this with a different variable that already exists in the dataset, it uses a lot of RAM, but seems to back off before it gets too high, e.g.

```r
open_dataset("data/pums/person") |>
  write_dataset("data/pums/person-cow-partition", partitioning = c("year", "COW"))
```

I *think* I'm missing something here in terms of what's going on, rather than this being a bug?

### Component(s)

C++

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Repartitioning a large dataset with 3 variables uses all my RAM and kills the process #40224

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Repartitioning a large dataset with 3 variables uses all my RAM and kills the process #40224

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions