Skip to content

groupby performance degrades rapidly as column cardinality increases #9292

@MrPowers

Description

@MrPowers

Dask can perform groupby operations relatively well for columns with low cardinality, but performance seems to degrade significantly for columns with more distinct values.

Let's look at the h2o groupby benchmarks for some examples on the 1e9 dataset (contains one billion rows and is 50 GB on disk when stored as an uncompressed CSV file).

Let's query this dataset with a 16 node i3.large cluster (244 GB of cluster RAM total). Here's a query that returns 10,000 distinct row groups (i.e. the result has 10,000 rows) and takes 37 seconds to execute. Here's the notebook with complete details on how to build the software environment, the cluster info, all the code, etc.

ddf = dd.read_parquet(
    "s3://coiled-datasets/h2o-benchmark/N_1e9_K_1e2_parquet",
    storage_options={"anon": True, "use_ssl": True},
    columns=["id1", "id2", "v1"],
    engine="pyarrow",
)

ddf.groupby(["id1", "id2"], dropna=False, observed=True).agg({"v1": "sum"}).compute()

Here's a different groupby query on the same dataset that returns a million row groups. I ran this query for a half hour or so and then it errored out with a KilledWorker exception. Here's the notebook.

ddf = dd.read_parquet(
    "s3://coiled-datasets/h2o-benchmark/N_1e9_K_1e2_parquet",
    storage_options={"anon": True, "use_ssl": True},
    columns=["id3", "v1", "v3"],
    engine="pyarrow",
)

ddf.groupby("id3", dropna=False, observed=True).agg(
    {"v1": "sum", "v3": "mean"}
).compute()

Let's try to get this running by using the pyarrow[string] data type and by repartitioning first. The pyarrow[string] dtype massively shrinks the memory footprint of the DataFrame from 224GB to 37GB. This query still errors out with a KilledWorker exception.

ddf["id3"] = ddf["id3"].astype("string[pyarrow]")

ddf.repartition(200).groupby("id3", dropna=False, observed=True).agg(
    {"v1": "sum", "v3": "mean"}
).compute()

It's the same dataset and the same cluster, so it seems like the issue is caused by the cardinality of the groupby column. I'm interested in hearing what other folks think is the root cause. Let me know if there is any additional info I can provide to give you with more context. I'm happy to take pictures of dashboards, look at logs, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions