groupby performance degrades rapidly as column cardinality increases

Dask can perform groupby operations relatively well for columns with low cardinality, but performance seems to degrade significantly for columns with more distinct values.

Let's look at the h2o groupby benchmarks for some examples on the 1e9 dataset (contains one billion rows and is 50 GB on disk when stored as an uncompressed CSV file).

Let's query this dataset with a 16 node i3.large cluster (244 GB of cluster RAM total).  Here's a query that returns 10,000 distinct row groups (i.e. the result has 10,000 rows) and takes 37 seconds to execute.  Here's [the notebook](https://github.com/coiled/coiled-resources/blob/main/benchmarks/h2o-groupby-query2.ipynb) with complete details on how to build the software environment, the cluster info, all the code, etc.

```python
ddf = dd.read_parquet(
    "s3://coiled-datasets/h2o-benchmark/N_1e9_K_1e2_parquet",
    storage_options={"anon": True, "use_ssl": True},
    columns=["id1", "id2", "v1"],
    engine="pyarrow",
)

ddf.groupby(["id1", "id2"], dropna=False, observed=True).agg({"v1": "sum"}).compute()
```

Here's a different groupby query on the same dataset that returns a million row groups.  I ran this query for a half hour or so and then it errored out with a KilledWorker exception.  Here's [the notebook](https://github.com/coiled/coiled-resources/blob/main/benchmarks/h2o-groupby-query3.ipynb).

```python
ddf = dd.read_parquet(
    "s3://coiled-datasets/h2o-benchmark/N_1e9_K_1e2_parquet",
    storage_options={"anon": True, "use_ssl": True},
    columns=["id3", "v1", "v3"],
    engine="pyarrow",
)

ddf.groupby("id3", dropna=False, observed=True).agg(
    {"v1": "sum", "v3": "mean"}
).compute()
```

Let's try to get this running by using the `pyarrow[string]` data type and by repartitioning first.  The `pyarrow[string]` dtype massively shrinks the memory footprint of the DataFrame from 224GB to 37GB.  This query still errors out with a KilledWorker exception.

```python
ddf["id3"] = ddf["id3"].astype("string[pyarrow]")

ddf.repartition(200).groupby("id3", dropna=False, observed=True).agg(
    {"v1": "sum", "v3": "mean"}
).compute()
```

It's the same dataset and the same cluster, so it seems like the issue is caused by the cardinality of the groupby column.  I'm interested in hearing what other folks think is the root cause.  Let me know if there is any additional info I can provide to give you with more context.  I'm happy to take pictures of dashboards, look at logs, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

groupby performance degrades rapidly as column cardinality increases #9292

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

groupby performance degrades rapidly as column cardinality increases #9292

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions