Enable automatic column projection for groupby aggregations by rjzamora · Pull Request #9442 · dask/dask

rjzamora · 2022-08-30T14:39:13Z

While looking at the h2o benchmark queries in coiled-runtime, I noticed that every query includes an explicit column selection operation right before the groupby-aggregation. There is nothing wrong with this - Explicit column selection is GOOD Dask-DataFrame practice. Doing this produces a getitem-based Blockwise layer in the high-level graph that can often be projected into the IO Layer at graph-optimization time.

The "problem" I see here is that most Dask users are unlikley to know about optimization oportunities like this.

This PR proposes that Dask automatically add explicit column-selection operations for groupby aggregations with dict-based aggregation specs. This change is relatively simple, but results in significant performance boost for naive user code. For example:

import dask.dataframe as dd
from dask.datasets import timeseries

ddf = timeseries(end='2003-01-31')
filtered = ddf[ddf['x'] > 0.5]
aggregated = filtered.groupby('id').agg({'x':'mean'})

%time aggregated.compute()

This PR: Wall time: 7.58 s
main: Wall time: 18.8 s

Note that this PR will effectively change filtered.groupby('id').agg({'x':'mean'}) to filtered[['id', 'x']].groupby('id').agg({'x':'mean'})

Tests added / passed
Passes pre-commit run --all-files

jrbourbeau

Thanks @rjzamora! Automating column projection where possible would be great 👍

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

jrbourbeau

Thanks @rjzamora!

jrbourbeau · 2022-08-30T20:49:52Z

+        # Make sure dict-based aggregation specs result in an
+        # explicit `getitem` layer to improve column projection
+        if isinstance(spec, dict):
+            assert hlg_layer(result1.dask, "getitem")


I like that this assert is generic, but I do wonder if it's too generic and could still continue to pass if we somehow loose blockwise + getitem fusion in the future. Thoughts on how we might be able to more explicitly test that the column projection we're after is indeed happening?

This is a good point. I did consider adding another getitem-fusion test, but it seemed to me like we already have decent coverage for the getitem optimization. I decided that the "important" change here is that the getitem layer is created, but I'm open to improving the coverage.

odovad · 2022-11-10T09:27:51Z

And what happens when we group by on the index column @rjzamora ? Thanks !

It seems that this is done :

_obj = self.obj[list(column_projection)]

And it raises a KeyError

rjzamora · 2022-11-10T14:28:17Z

Good catch @odovad - Thanks for pointing this out!

This PR corresponds to the Dask-cudf version of dask/dask#9442, which was found to improve the performance of many groupby-based workflows. After this PR, ```python import dask_cudf path = "/criteo-dataset/day_0.parquet" ddf = dask_cudf.read_parquet(path, split_row_groups=10) # The following takes <2s with this PR, and fails with # an OOM error on main (using a 32GB GPU): ddf.groupby("C1").agg({"C2": "mean"}).compute() ``` Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #12124

rjzamora added 2 commits August 30, 2022 06:57

add explicit column projection to appropriate groupby aggregations

331193e

update testing

30aa9b3

rjzamora added the dataframe label Aug 30, 2022

jrbourbeau reviewed Aug 30, 2022

View reviewed changes

Comment thread dask/dataframe/groupby.py Outdated

jrbourbeau mentioned this pull request Aug 30, 2022

Release 2022.9.0 dask/community#270

Closed

5 tasks

Update dask/dataframe/groupby.py

6dbdae2

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

jrbourbeau changed the title ~~[Optimization] Enable column projection for groupby aggregations~~ Enable automatic column projection for groupby aggregations Aug 30, 2022

jrbourbeau approved these changes Aug 30, 2022

View reviewed changes

jrbourbeau merged commit 19a5147 into dask:main Aug 30, 2022

rjzamora deleted the groupby-getitem branch August 30, 2022 21:50

ian-r-rose mentioned this pull request Oct 3, 2022

Dask & Pandas giving different results for groupby aggregations #9535

Closed

rjzamora mentioned this pull request Nov 10, 2022

Groupby aggregations fail when grouping on index by name #9643

Closed

rjzamora mentioned this pull request Nov 10, 2022

Enable automatic column projection in groupby().agg rapidsai/cudf#12124

Merged

3 tasks

rjzamora mentioned this pull request Nov 16, 2022

Enable column projection for groupby slicing #9667

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable automatic column projection for groupby aggregations#9442

Enable automatic column projection for groupby aggregations#9442
jrbourbeau merged 3 commits intodask:mainfrom
rjzamora:groupby-getitem

rjzamora commented Aug 30, 2022 •

edited

Loading

Uh oh!

jrbourbeau left a comment

Uh oh!

Uh oh!

jrbourbeau left a comment

Uh oh!

jrbourbeau Aug 30, 2022

Uh oh!

rjzamora Aug 30, 2022

Uh oh!

odovad commented Nov 10, 2022 •

edited

Loading

Uh oh!

rjzamora commented Nov 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rjzamora commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

odovad commented Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Nov 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjzamora commented Aug 30, 2022 •

edited

Loading

odovad commented Nov 10, 2022 •

edited

Loading