Add optional information about originating function call in DataFrameIOLayer#8453
Add optional information about originating function call in DataFrameIOLayer#8453
Conversation
…alled on a DataFrameIOLayer
| # However, we can use `creation_info` to regenerate | ||
| # the same collection with `filters` defined | ||
| info = ddf2.dask.layers[ddf2._name].creation_info | ||
| kwargs = info.get("kwargs", {}) | ||
| kwargs["filters"] = filters | ||
| ddf3 = info["func"](*info.get("args", []), **kwargs) |
There was a problem hiding this comment.
cc @quasiben @charlesbluca @jdye64
I am hoping that having access to the original function and arguments like this will make predicate-pushdown relatively straightforward once we have extracted the filters.
Unlike column projection, we cannot just modify the io_func object living within the original DataFrameIOLayer, because the filtering happens up front (before the graph is even generated).
gjoseph92
left a comment
There was a problem hiding this comment.
I don't really understand optimization well enough to comment on the overall strategy here. But this seems reasonable enough to me.
dask/layers.py
Outdated
| @@ -345,20 +345,7 @@ def fractional_slice(task, axes): | |||
| class DataFrameLayer(Layer): | |||
There was a problem hiding this comment.
Should this class even exist anymore?
There was a problem hiding this comment.
At this point, it acts only as a "label" that we could use for complex HLG optimizations (but that we are not actually using at all yet). So, we could certainly remove it, but will need to add something like it for multi-layer column projection and/or multi-layer predicate pushdown.
There was a problem hiding this comment.
Do you think this base class will eventually have some things defined on it (for column projection / predicate pushdown)? Or will it always just be an empty sentinel?
As an empty sentinel, it feels a little weird to me. Like I assume it's supposed to indicate that its subclasses all support some type of interface. I think it would be nice for that interface to be defined (even as an abstract method) on the base class, just for readability.
dask/dataframe/optimize.py
Outdated
| warnings.warn( | ||
| f"{type(new)} return type for project_columns is deprecated. " | ||
| f"This method should now return the projected Layer only.", | ||
| FutureWarning, |
There was a problem hiding this comment.
Is going through a deprecation cycle even necessary here?
There was a problem hiding this comment.
Probably not - Just makes me nervous to silently allow deprecated behavior. However, we could also just let non-conforming code fail.
There was a problem hiding this comment.
I guess it's just a question of whether DataFrameLayer subclasses are a public API right now. Based on the discussion we recently had with @ian-r-rose, I feel like basically the answer is no—people shouldn't be developing against HLG layers yet.
This little deprecation is simple and fine, but it's even simpler to not have to remember to fully remove it at some point. If internal Dask code is the only thing using this interface, seems fine to me.
|
Responding to @rjzamora's comment on #8487 (comment):
I would be +0 on still removing
If there's downstream use, that would be really good to know about. I didn't realize there was any. I'd still be tempted to not do the deprecation cycle though, and just coordinate with that downstream code to make the change? |
It's at least used in |
|
Ah! It was with GDAL in dask-geopandas! https://github.com/geopandas/dask-geopandas/pull/123/files |
|
I should note, all of those I listed are using |
|
Okay - I removed |
This PR does two simple things:
creation_infoargument toDataFrameIOLayer, which simply stores the callable (in "func"), positional arguments (in "args"), and key-word arguments (in "kwargs") used to create the original DataFrame collection. The motivation for this PR is to make basic predicate-pushdown optimizations much simpler in both dask.dataframe.optimize, and in dask-sql. In dask-sql, where it should be possible to extract a DNF-formattedfiltersargument from Calcite, thecreation_infoattribute should be all that is needed for filters to be pushed into the underlyingread_parquetcall.column_projectionfromDataFrameLayer, since that method only makes sense inDataFrameIOLayer. Note that I would be happy to move this small change into a standalone PR. However, I am including it here since I accidentally used the same branch for thecreation_infochange :)