HLG optimization for read_parquet + iloc#6345
Conversation
|
Question: In my head I was thinking of an approach like
Do you know if something like that has a chance of working? |
yeah, so I've found a better check which is checking that re 2: I think I got too fancy trying to handle slices -- might make more sense to take a first cut with single columns. |
|
Ok, I think that's working (still as a separate optimization function). I'll spend some time tomorrow to see what an overlay would look like. Most of the code is duplicated, as you noted, just some special handling around column names vs. indices, I think. |
This extends the existing `read_parquet -> getitem` optimization to also cover `read_parquet -> iloc`. The way it's currently set up it: * Only supports single-column selection (this could be changed) * Doesn't support dataframes with duplicate column names
|
Ok, this is ready for another look. It was more straight-forward than I thought to combine them (woo!). |
| if list(block.dsk.values())[0][0] != operator.getitem: | ||
| if list(block.dsk.values())[0][ | ||
| 0 | ||
| ] != operator.getitem and not block.output.startswith("iloc"): |
There was a problem hiding this comment.
I'm open to suggestions if there's a better way to check for a read_parquet -> iloc
There was a problem hiding this comment.
To be sure: does the current optimization work with loc, which is really identical to column selection if the other selector is :?
There was a problem hiding this comment.
I doubt that it works with .loc.
There was a problem hiding this comment.
I'm open to suggestions if there's a better way to check for a read_parquet -> iloc
Probably out of scope for this PR, but I think we'll want to specialize the kind of Block that we return from an iloc operation. This feels similar to #6261 where I added things like BlockwiseGetitem to avoid having to dive into Blockwise objects to figure out what they are.
TomAugspurger
left a comment
There was a problem hiding this comment.
Just to be sure, can you do something like
# column order is A B C
df = dd.read_parquet(...)[['B', 'A']] # swap the columns
result = df.iloc[:, 0]
assert result.name == "B"
assert result.compute().name == "B"| if list(block.dsk.values())[0][0] != operator.getitem: | ||
| if list(block.dsk.values())[0][ | ||
| 0 | ||
| ] != operator.getitem and not block.output.startswith("iloc"): |
There was a problem hiding this comment.
I doubt that it works with .loc.
| if list(block.dsk.values())[0][0] != operator.getitem: | ||
| if list(block.dsk.values())[0][ | ||
| 0 | ||
| ] != operator.getitem and not block.output.startswith("iloc"): |
There was a problem hiding this comment.
I'm open to suggestions if there's a better way to check for a read_parquet -> iloc
Probably out of scope for this PR, but I think we'll want to specialize the kind of Block that we return from an iloc operation. This feels similar to #6261 where I added things like BlockwiseGetitem to avoid having to dive into Blockwise objects to figure out what they are.
|
|
||
| block_columns = block.indices[1][0] | ||
| if isinstance(block_columns, slice): | ||
| # only single-column iloc is currently optimized |
There was a problem hiding this comment.
What prevents us from doing the same slice on old.meta.columns[block_columns] here?
There was a problem hiding this comment.
Nothing prevents it here, but I noticed that the set operation below scrambles the column order with some regularity
That's a great check, added as another test. |
|
Quick thought: what if instead we changed the definition of I understand that there are some tricky cases of iloc when we have duplicate column names, but my guess is that those are very rare. By trying to centralize many API routes down to a few common operations we might be able to more easily reason about optimizations in the future. |
I think that's what I originally had conceived of. In that case, though, we would have to check for and forgo the optimisation in the case of duplicate labels. Note that parquet does not allow duplicates, but pyarrow will apply pandas names to columns on load, so duplicates are technically still possible. |
|
Yeah, duplicate labels seem uncommon to me though. I guess this comes down to figuring out the cost of maintaining an alternate iloc optimization code path vs the value of how often this code path will be used. I don't have a good sense of that. |
I'm game to try that out and see what it looks like. Any strong feelings on approach? |
|
I would choose (b) personally. I think that it's likely to be easier and
have broader reaching effects
…On Mon, Jun 29, 2020 at 9:17 AM Gil Forsyth ***@***.***> wrote:
Quick thought: what if instead we changed the definition of df.iloc[:,
...] to call df[...] in common case simple situations?
I'm game to try that out and see what it looks like.
I can think of two ways to handle this:
a) tweak the HLG and replace iloc with getitem there
b) change the iloc method on dask.dataframe.core to do a column name
lookup and then dispatch to __getitem__
Any strong feelings on approach?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6345 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEANB3C33B3URQXVO3RZC5ABANCNFSM4OIQYZTQ>
.
|
|
So this may not be needed after #6355? |
|
Yep, closing in favor of #6355 |
black dask/flake8 daskStarting to address one of the points raised by @martindurant in #6264.
Adds a high level graph optimization to handle a
read_parquetfollowed by a full-columnilocby updating theParquetSubgraph.Very much a draft and very much not completely working.
Need to:
[ ] not assume that all calls todask only supports full column selection, so nm for nowilocare for full column selectionsgetitemoptimizations as there are a bunch of similarities (and a number of differences)But, initial simple benchmark:
timeseries parquet file created with
dask.dataframe.demowith 4 columns and 34447680 rowson master:
with optimization: