Conversation
jsignell
left a comment
There was a problem hiding this comment.
I like the idea of putting it in collection_annotations
Nice. One of my primary concerns is that these annotations are printed in the Layer html repr, and printing every partition length can become pretty verbose. Of course, that answer to that problem may just be to provide a mechanism to specify annotations that shouldn't be written. |
|
Are we planning to track this information through other operations, or is
it only useful for `len(dd.read_parquet(...))`.
If the answer is yes, we plan to track it, then I think we need a plan for
that. I think that high level expressions are that plan. If the answer is
"no, this is only useful for the very specific case of
`len(dd.read_parquet(...))` then I'm curious about how often that comes up.
…On Mon, Jul 19, 2021 at 2:30 PM Richard (Rick) Zamora < ***@***.***> wrote:
I like the idea of putting it in collection_annotations
Nice. One of my primary concerns is that these annotations are printed in
the Layer html repr, and printing every partition length can become pretty
verbose. Of course, that answer to that problem may just be to provide a
mechanism to specify annotations that *shouldn't* be written.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7912 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTA7EXK2IMGHCDGUAVDTYSKQJANCNFSM5AUEXNHA>
.
|
I'm still uncertain if length information is worth tracking through other operations. However, I can say for sure that the len(dd.read_parquet(...)) case is quite important for NVTabular, where we currently need to process the parquet metadata separately (and redundantly) to get the dataset size without calling len(ddf). The motivation comes from the fact that many DL dataloading APIs (for out-of-core data) require the user to specify the total length of the dataset. I am pretty annoyed by this requirement, but it's something we need to provide for now. |
I have not yet found the discussion where we go over the idea of storing dataframe-level attributes/metadata (which predated high-level graphs). |
This is a good point. The current implementation is using a |
Understood - two slightly different things (parsing thrift footers versus constructing the statistics structure) |
|
I just came across a version of this issue that has a really nice comment from Tom. Might be worth considering some of these questions:
ref: #5633 (comment) |
|
Thanks for digging this up @jsignell ! I totally agree with @TomAugspurger
This is exactly what I am hoping to decide on here. My original solution was to store the information at the Layer level in the
If we store the partition lengths in a Setting the length information this way becomes trivial in IO functions like
Methods in the dask.dataframe API should be free to use Note that there are probably other methods, besides
I think the primary concern is that we can propagate known lengths through |
dask/dataframe/core.py
Outdated
| meta=no_default, | ||
| enforce_metadata=True, | ||
| transform_divisions=True, | ||
| length_preserving=False, |
There was a problem hiding this comment.
Nitpick, but the other kwargs are {verb}_{noun} how would you feel about preserve_length
There was a problem hiding this comment.
Makes sense. I have no idea what to call this - So, your opinion is useful, and not just a nitpick :)
dask/dataframe/core.py
Outdated
| ) | ||
| self._meta = meta | ||
| self.divisions = tuple(divisions) | ||
| self._lens = lens |
There was a problem hiding this comment.
Should there be any validation? Like does it need to be the same len as divisions - 1?
mrocklin
left a comment
There was a problem hiding this comment.
Thank you for experimenting with this @rjzamora . I'm currently -1 on this approach. I think that it elevates a relatively small optimization too much too high a level in the abstraction. For example, someone could easy "well, I want to compute max" and then we have to add maxes everywhere. Same with uniqueness, emptiness, min-ness, etc.. I think that we can still solve what you want to solve, but that we should find another way.
dask/dataframe/core.py
Outdated
| def __init__(self, dsk, name, meta, divisions=None): | ||
| # divisions is ignored, only present to be compatible with other | ||
| # objects. | ||
| def __init__(self, dsk, name, meta, divisions=None, lens=None): |
There was a problem hiding this comment.
If we're going to add this, let's use the full word, "lengths", which I think will be clearer.
dask/dataframe/core.py
Outdated
| """ | ||
|
|
||
| def __init__(self, dsk, name, meta, divisions): | ||
| def __init__(self, dsk, name, meta, divisions, lens=None): |
There was a problem hiding this comment.
I am currently -1 to putting this in the DataFrame constructor. I think that this is too niche of a topic I think to be on the same level as meta/divisions. I think that we can find another way.
|
In general I think that you should expect any change to the core metadata of dataframes on top of meta/divisions/graph will be met with extreme levels of scrutiny :) |
|
Thanks for taking a look @mrocklin ! I am not happy with this "proposal" yet, so your comments are helpful.
No worries. I absolutely expected a -1 from you :) With that said, I am fairly certain that we do need to come up with a way to store/track this kind of information. So, I am doing my best to experiment with different solutions in an open-minded way. Note that I locally implemented a Layer-centered solution that works fine, but I switched to the To be clear, I am very hesitant to change the
I also had the same "slippery-slope" thought that users may want to track other "collection statistics" to optimize similar operations. Perhaps a reasonable compromise here is to (1) change |
|
Just as a note: Array has been doing a bit of this kind of pattern with |
|
@jsignell - I decided to experiment with your cached_property idea. Please feel welcome to advise :) |
|
Is this PR still active? Are there plans to get this merged in anytime soon? @rjzamora |
|
I suspect that this got superseded by the discussions around a high-level expression system for encapsulating all the dataframe metadata #7933 |
Dask does not currently save/leverage the parquet-metadata statistics tp calculate the length of a DataFrame collection. This PR modifies
read_parquetto save the size of each partition (when this information is available and correct), and then uses it inDataFrame.__len__.I am confident that we definitely want to do something like this. However, I will mark this PR as a draft until we can agree on "where" the partition-size metadata should be stored. For now, I am adding it to
collection_annotations, but it may also make sense to make it aDataFrameIOLayerattribute.