Skip to content

Add optional information about originating function call in DataFrameIOLayer#8453

Merged
rjzamora merged 3 commits intodask:mainfrom
rjzamora:manual-parquet-hlg-filters
Jan 22, 2022
Merged

Add optional information about originating function call in DataFrameIOLayer#8453
rjzamora merged 3 commits intodask:mainfrom
rjzamora:manual-parquet-hlg-filters

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Dec 3, 2021

This PR does two simple things:

  1. It adds a new (optional) creation_info argument to DataFrameIOLayer, which simply stores the callable (in "func"), positional arguments (in "args"), and key-word arguments (in "kwargs") used to create the original DataFrame collection. The motivation for this PR is to make basic predicate-pushdown optimizations much simpler in both dask.dataframe.optimize, and in dask-sql. In dask-sql, where it should be possible to extract a DNF-formatted filters argument from Calcite, the creation_info attribute should be all that is needed for filters to be pushed into the underlying read_parquet call.
  2. It removes column_projection from DataFrameLayer, since that method only makes sense in DataFrameIOLayer. Note that I would be happy to move this small change into a standalone PR. However, I am including it here since I accidentally used the same branch for the creation_info change :)

@rjzamora rjzamora added io highlevelgraph Issues relating to HighLevelGraphs. parquet labels Dec 3, 2021
Comment on lines +2514 to +2519
# However, we can use `creation_info` to regenerate
# the same collection with `filters` defined
info = ddf2.dask.layers[ddf2._name].creation_info
kwargs = info.get("kwargs", {})
kwargs["filters"] = filters
ddf3 = info["func"](*info.get("args", []), **kwargs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @quasiben @charlesbluca @jdye64

I am hoping that having access to the original function and arguments like this will make predicate-pushdown relatively straightforward once we have extracted the filters.

Unlike column projection, we cannot just modify the io_func object living within the original DataFrameIOLayer, because the filtering happens up front (before the graph is even generated).

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jan 3, 2022
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand optimization well enough to comment on the overall strategy here. But this seems reasonable enough to me.

dask/layers.py Outdated
@@ -345,20 +345,7 @@ def fractional_slice(task, axes):
class DataFrameLayer(Layer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class even exist anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, it acts only as a "label" that we could use for complex HLG optimizations (but that we are not actually using at all yet). So, we could certainly remove it, but will need to add something like it for multi-layer column projection and/or multi-layer predicate pushdown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this base class will eventually have some things defined on it (for column projection / predicate pushdown)? Or will it always just be an empty sentinel?

As an empty sentinel, it feels a little weird to me. Like I assume it's supposed to indicate that its subclasses all support some type of interface. I think it would be nice for that interface to be defined (even as an abstract method) on the base class, just for readability.

warnings.warn(
f"{type(new)} return type for project_columns is deprecated. "
f"This method should now return the projected Layer only.",
FutureWarning,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is going through a deprecation cycle even necessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not - Just makes me nervous to silently allow deprecated behavior. However, we could also just let non-conforming code fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's just a question of whether DataFrameLayer subclasses are a public API right now. Based on the discussion we recently had with @ian-r-rose, I feel like basically the answer is no—people shouldn't be developing against HLG layers yet.

This little deprecation is simple and fine, but it's even simpler to not have to remember to fully remove it at some point. If internal Dask code is the only thing using this interface, seems fine to me.

@gjoseph92
Copy link
Collaborator

Responding to @rjzamora's comment on #8487 (comment):

My interest in leaving DataFrameLayer around for now (1), is that I do intend to use it to store necessary information for multi-layer column projection (and evenutally predicate pushdown). For example, we may use that class to define a base required_columns method designed to return the input columns required to produce a specific set of output columns (for that specific Layer). This would be similar to project_columns, but we would be returning a set of column names, rahter than a new Layer.

I would be +0 on still removing DataFrameLayer for now, and adding it back in when we do add these column projection features. Seems like a pretty small diff to make in both cases, but leaves the code in a cleaner state in the interim, so that anyone else reading it who didn't happen to see this conversation on GitHub doesn't have to wonder why that dead code is there.

Regarding (2): I have no problem with removing the deprecation warning and just letting old behavior fail. However, I seem to remember @ian-r-rose telling me that there is at least one DataFrameIOLayer definition in down-stream code.

If there's downstream use, that would be really good to know about. I didn't realize there was any. I'd still be tempted to not do the deprecation cycle though, and just coordinate with that downstream code to make the change?

@ian-r-rose
Copy link
Collaborator

Regarding (2): I have no problem with removing the deprecation warning and just letting old behavior fail. However, I seem to remember @ian-r-rose telling me that there is at least one DataFrameIOLayer definition in down-stream code.

If there's downstream use, that would be really good to know about. I didn't realize there was any. I'd still be tempted to not do the deprecation cycle though, and just coordinate with that downstream code to make the change?

It's at least used in dask-bigquery and dask-snowflake. I seem to recall at least one other place where there were some experiments with it, but I'm having a hard time coming up with it right now.

@ian-r-rose
Copy link
Collaborator

Ah! It was with GDAL in dask-geopandas! https://github.com/geopandas/dask-geopandas/pull/123/files

@ian-r-rose
Copy link
Collaborator

I should note, all of those I listed are using DataFrameIOLayer, so I think you're probably in the clear

@rjzamora
Copy link
Member Author

rjzamora commented Jan 21, 2022

Okay - I removed DataFrameLayer and the FutureWarning in optimize_dataframe_getitem. We can always add back DataFrameLayer when it actually has a purpose, and the project_columns change is very unlikely to cause problems for downstream users of DataFrameIOLayer (because they are most likely to implement project_columns in the IO function wrapper (where behavior has not changed).

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@rjzamora rjzamora merged commit 92140f0 into dask:main Jan 22, 2022
@rjzamora rjzamora deleted the manual-parquet-hlg-filters branch January 22, 2022 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe highlevelgraph Issues relating to HighLevelGraphs. io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants