HLG optimization for read_parquet + iloc by gforsyth · Pull Request #6345 · dask/dask

gforsyth · 2020-06-25T16:12:41Z

Tests added / passed
Passes black dask / flake8 dask

Starting to address one of the points raised by @martindurant in #6264.

Adds a high level graph optimization to handle a read_parquet followed by a full-column iloc by updating the ParquetSubgraph.

Very much a draft and very much not completely working.

Need to:

make it pass tests
~~[ ] not assume that all calls to iloc are for full column selections~~ dask only supports full column selection, so nm for now
Think about how / if to combine parts of this with the existing getitem optimizations as there are a bunch of similarities (and a number of differences)

But, initial simple benchmark:
timeseries parquet file created with dask.dataframe.demo with 4 columns and 34447680 rows

on master:

%%time
len(df)

CPU times: user 8.04 s, sys: 4.2 s, total: 12.2 s
Wall time: 6.21 s

with optimization:

%%time
len(df)

running getitem optimization
running iloc optimization
Updating iloc subgraph!
CPU times: user 4.62 s, sys: 734 ms, total: 5.35 s
Wall time: 2.14 s

TomAugspurger · 2020-06-25T20:53:28Z

Question: In my head I was thinking of an approach like

Allow iloc in addition to getitem at

dask/dask/dataframe/optimize.py

Lines 63 to 65 in 7138f47

    
           if list(block.dsk.values())[0][0] != operator.getitem: 
        
               # ... where this value is __getitem__... 
        
               return dsk

For iloc, use meta[block.indices[1][0] at

dask/dask/dataframe/optimize.py

Line 75 in 7138f47

block_columns = block.indices[1][0]

Do you know if something like that has a chance of working?

gforsyth · 2020-06-25T21:06:41Z

Allow iloc in addition to getitem at

yeah, so I've found a better check which is checking that block.output.startswith("iloc-") -- there's no clean operator check since the Blockwise looks like

> block.dsk
{'iloc-677f4093af82134a39609fb71cf9586a': (<function apply at 0x7fb6ff4808b0>, <function apply_and_enforce at 0x7fb701cfc940>, ['_0', '_1'], (<class 'dict'>, [['_func', <function iloc at 0x7fb701ccd790>], ['_meta', Empty DataFrame
Columns: [name, x]
Index: []]]))}

re 2: I think I got too fancy trying to handle slices -- might make more sense to take a first cut with single columns.

gforsyth · 2020-06-25T21:26:53Z

Ok, I think that's working (still as a separate optimization function). I'll spend some time tomorrow to see what an overlay would look like. Most of the code is duplicated, as you noted, just some special handling around column names vs. indices, I think.

This extends the existing `read_parquet -> getitem` optimization to also cover `read_parquet -> iloc`. The way it's currently set up it: * Only supports single-column selection (this could be changed) * Doesn't support dataframes with duplicate column names

gforsyth · 2020-06-26T15:33:44Z

Ok, this is ready for another look. It was more straight-forward than I thought to combine them (woo!).
Really the only bits where there is divergence with the existing getitem stuff is that the block indices are indices instead of column names and the check for an iloc doesn't have a nice operator check available.

dask/dataframe/optimize.py

gforsyth · 2020-06-26T15:35:08Z

dask/dataframe/optimize.py

-            if list(block.dsk.values())[0][0] != operator.getitem:
+            if list(block.dsk.values())[0][
+                0
+            ] != operator.getitem and not block.output.startswith("iloc"):


I'm open to suggestions if there's a better way to check for a read_parquet -> iloc

To be sure: does the current optimization work with loc, which is really identical to column selection if the other selector is :?

I doubt that it works with .loc.

I'm open to suggestions if there's a better way to check for a read_parquet -> iloc

Probably out of scope for this PR, but I think we'll want to specialize the kind of Block that we return from an iloc operation. This feels similar to #6261 where I added things like BlockwiseGetitem to avoid having to dive into Blockwise objects to figure out what they are.

TomAugspurger

Just to be sure, can you do something like

# column order is A B C
df = dd.read_parquet(...)[['B', 'A']]  # swap the columns
result = df.iloc[:, 0]
assert result.name == "B"
assert result.compute().name == "B"

TomAugspurger · 2020-06-26T15:47:36Z

dask/dataframe/optimize.py

-            if list(block.dsk.values())[0][0] != operator.getitem:
+            if list(block.dsk.values())[0][
+                0
+            ] != operator.getitem and not block.output.startswith("iloc"):


I doubt that it works with .loc.

TomAugspurger · 2020-06-26T15:49:38Z

dask/dataframe/optimize.py

-            if list(block.dsk.values())[0][0] != operator.getitem:
+            if list(block.dsk.values())[0][
+                0
+            ] != operator.getitem and not block.output.startswith("iloc"):


I'm open to suggestions if there's a better way to check for a read_parquet -> iloc

Probably out of scope for this PR, but I think we'll want to specialize the kind of Block that we return from an iloc operation. This feels similar to #6261 where I added things like BlockwiseGetitem to avoid having to dive into Blockwise objects to figure out what they are.

TomAugspurger · 2020-06-26T15:50:27Z

dask/dataframe/optimize.py


            block_columns = block.indices[1][0]
+            if isinstance(block_columns, slice):
+                # only single-column iloc is currently optimized


What prevents us from doing the same slice on old.meta.columns[block_columns] here?

Nothing prevents it here, but I noticed that the set operation below scrambles the column order with some regularity

gforsyth · 2020-06-26T16:01:29Z

# column order is A B C
df = dd.read_parquet(...)[['B', 'A']]  # swap the columns
result = df.iloc[:, 0]
assert result.name == "B"
assert result.compute().name == "B"

That's a great check, added as another test.
Also yes, that works 🎉

mrocklin · 2020-06-26T16:14:07Z

Quick thought: what if instead we changed the definition of df.iloc[:, ...] to call df[...] in common case simple situations?

I understand that there are some tricky cases of iloc when we have duplicate column names, but my guess is that those are very rare. By trying to centralize many API routes down to a few common operations we might be able to more easily reason about optimizations in the future.

martindurant · 2020-06-26T16:20:34Z

what if instead we changed the definition of df.iloc[:, ...] to call df[...] in common case simple situations?

I think that's what I originally had conceived of. In that case, though, we would have to check for and forgo the optimisation in the case of duplicate labels. Note that parquet does not allow duplicates, but pyarrow will apply pandas names to columns on load, so duplicates are technically still possible.

mrocklin · 2020-06-26T16:21:54Z

Yeah, duplicate labels seem uncommon to me though. I guess this comes down to figuring out the cost of maintaining an alternate iloc optimization code path vs the value of how often this code path will be used. I don't have a good sense of that.

gforsyth · 2020-06-29T16:16:49Z

Quick thought: what if instead we changed the definition of df.iloc[:, ...] to call df[...] in common case simple situations?

I'm game to try that out and see what it looks like.
I can think of two ways to handle this:
a) tweak the HLG and replace iloc with getitem there
b) change the iloc method on dask.dataframe.core to do a column name lookup and then dispatch to __getitem__

Any strong feelings on approach?

mrocklin · 2020-06-29T16:18:03Z

I would choose (b) personally. I think that it's likely to be easier and have broader reaching effects

…

On Mon, Jun 29, 2020 at 9:17 AM Gil Forsyth ***@***.***> wrote: Quick thought: what if instead we changed the definition of df.iloc[:, ...] to call df[...] in common case simple situations? I'm game to try that out and see what it looks like. I can think of two ways to handle this: a) tweak the HLG and replace iloc with getitem there b) change the iloc method on dask.dataframe.core to do a column name lookup and then dispatch to __getitem__ Any strong feelings on approach? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6345 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEANB3C33B3URQXVO3RZC5ABANCNFSM4OIQYZTQ> .

TomAugspurger · 2020-06-30T14:58:09Z

So this may not be needed after #6355?

gforsyth · 2020-06-30T15:08:41Z

Yep, closing in favor of #6355

gforsyth force-pushed the iloc_hlg branch from 57ee7be to 049a2e0 Compare June 26, 2020 15:25

gforsyth changed the title ~~[WIP] HLG optimization for read_parquet + iloc~~ HLG optimization for read_parquet + iloc Jun 26, 2020

gforsyth marked this pull request as ready for review June 26, 2020 15:27

gforsyth force-pushed the iloc_hlg branch from 049a2e0 to d9f23b3 Compare June 26, 2020 15:30

gforsyth commented Jun 26, 2020

View reviewed changes

Remove duplicate column check and add test

6197bd4

TomAugspurger reviewed Jun 26, 2020

View reviewed changes

Add out-of-order column iloc test

1b395dc

gforsyth mentioned this pull request Jun 29, 2020

Dispatch iloc calls to getitem #6355

Merged

2 tasks

gforsyth closed this Jun 30, 2020

Uh oh!

Conversation

gforsyth commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Jun 25, 2020

Uh oh!

gforsyth commented Jun 25, 2020

Uh oh!

gforsyth commented Jun 25, 2020

Uh oh!

gforsyth commented Jun 26, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gforsyth commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jun 26, 2020

Uh oh!

martindurant commented Jun 26, 2020

Uh oh!

mrocklin commented Jun 26, 2020

Uh oh!

gforsyth commented Jun 29, 2020

Uh oh!

mrocklin commented Jun 29, 2020 via email

Uh oh!

TomAugspurger commented Jun 30, 2020

Uh oh!

gforsyth commented Jun 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gforsyth commented Jun 25, 2020 •

edited

Loading

gforsyth commented Jun 26, 2020 •

edited

Loading