Use parquet metadata to calculate `len` by rjzamora · Pull Request #7912 · dask/dask

rjzamora · 2021-07-19T17:35:06Z

Dask does not currently save/leverage the parquet-metadata statistics tp calculate the length of a DataFrame collection. This PR modifies read_parquet to save the size of each partition (when this information is available and correct), and then uses it in DataFrame.__len__.

I am confident that we definitely want to do something like this. However, I will mark this PR as a draft until we can agree on "where" the partition-size metadata should be stored. For now, I am adding it to collection_annotations, but it may also make sense to make it a DataFrameIOLayer attribute.

jsignell

I like the idea of putting it in collection_annotations

rjzamora · 2021-07-19T21:30:33Z

I like the idea of putting it in collection_annotations

Nice. One of my primary concerns is that these annotations are printed in the Layer html repr, and printing every partition length can become pretty verbose. Of course, that answer to that problem may just be to provide a mechanism to specify annotations that shouldn't be written.

mrocklin · 2021-07-19T22:16:30Z

Are we planning to track this information through other operations, or is it only useful for `len(dd.read_parquet(...))`. If the answer is yes, we plan to track it, then I think we need a plan for that. I think that high level expressions are that plan. If the answer is "no, this is only useful for the very specific case of `len(dd.read_parquet(...))` then I'm curious about how often that comes up.

…

On Mon, Jul 19, 2021 at 2:30 PM Richard (Rick) Zamora < ***@***.***> wrote: I like the idea of putting it in collection_annotations Nice. One of my primary concerns is that these annotations are printed in the Layer html repr, and printing every partition length can become pretty verbose. Of course, that answer to that problem may just be to provide a mechanism to specify annotations that *shouldn't* be written. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7912 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTA7EXK2IMGHCDGUAVDTYSKQJANCNFSM5AUEXNHA> .

rjzamora · 2021-07-19T22:25:48Z

Are we planning to track this information through other operations, or is
it only useful for len(dd.read_parquet(...)).

I'm still uncertain if length information is worth tracking through other operations. However, I can say for sure that the len(dd.read_parquet(...)) case is quite important for NVTabular, where we currently need to process the parquet metadata separately (and redundantly) to get the dataset size without calling len(ddf). The motivation comes from the fact that many DL dataloading APIs (for out-of-core data) require the user to specify the total length of the dataset. I am pretty annoyed by this requirement, but it's something we need to provide for now.

martindurant · 2021-07-20T15:52:59Z

I would certainly recommend tracking the length information! Some operations like column-select/assign or map are guaranteed to preserve the row-count. The total count would be nice to include in the graphical or text output (since it's free).
I think you are using parquet statistics only here I think? The row count per row group and globally (if there is _metadata) does not require column statistics. Statistics are optional, but row counts are mandatory. If you don't have _metadata, then of course you need to open the constituent files to get their counts (which is still faster than loading ; and this might be loaded and cached upon len).

I have not yet found the discussion where we go over the idea of storing dataframe-level attributes/metadata (which predated high-level graphs).

rjzamora · 2021-07-20T16:04:22Z

I think you are using parquet statistics only here I think? The row count per row group and globally (if there is _metadata) does not require column statistics. Statistics are optional, but row counts are mandatory.

This is a good point. The current implementation is using a statistics structure to collect the partition-wise row-counts, but the the row-count is not actually a real "statistic". If there are no filters or index columns to deal with, the statistics list will not actually include any column-chunk information ("statistics") at all. With that said, we avoid parsing/organizing any of this metadata-based information when the user explicitly defines gather_statistics=False.

martindurant · 2021-07-20T16:08:51Z

we avoid parsing/organizing any of this metadata-based information when the user explicitly defines gather_statistics=False.

Understood - two slightly different things (parsing thrift footers versus constructing the statistics structure)

jsignell · 2021-07-21T13:53:03Z

I just came across a version of this issue that has a really nice comment from Tom. Might be worth considering some of these questions:

FWIW, I think that this is exactly the case we'd first introduce length-aware DataFrame partitions. But the details are a tricky. If you're interested in pursing this, I think we would need

A proposal on how this will be stored on DataFrame / Series / Index (dask.array.Array stores ._chunks as a tuple

A way to pass through this information when creating these objects

A proposal on how methods can use this information (e.g. __len__ can be an example). This will likely need to describe how things fallback to the common unkown-length case (is that automatic? Can that be configured to error?).

A proposal for how methods can indicate that they're shape-preserving. e.g. DataFrame.rename(columns=...) will not affect the rows, so the lengths are preserved. But DataFrame[mask] would have to invalidate the rows.

So not insurmountable, but a good amount of work.

ref: #5633 (comment)

rjzamora · 2021-07-21T17:05:53Z

Thanks for digging this up @jsignell ! I totally agree with @TomAugspurger

A proposal on how this will be stored on DataFrame / Series / Index (dask.array.Array stores ._chunks as a tuple

This is exactly what I am hoping to decide on here. My original solution was to store the information at the Layer level in the collection_annotations attribute. I also suggested the possibility of attaching this information to the Layer object as a dedicated _partition_lens attribute. However, I am starting to believe that this particular type of information deserves to live as an optional _Frame attribute as _Frame._lens.

A way to pass through this information when creating these objects

If we store the partition lengths in a _lens attribute, we need to update the _Frame, DataFrame, and Series constructors to accept this optional attribute. We would also need to add it to new_dd_object (since it is typically the “canonical” API for constructing a new _Frame collection in dask.dataframe).

Setting the length information this way becomes trivial in IO functions like read_parquet and from_pandas (you just pass the list into new_dd_object when it is available). For simple length-preserving operations, like assign, we can also modify elemwise to propagate these lengths (if/when they are known).

A proposal on how methods can use this information (e.g. len can be an example). This will likely need to describe how things fallback to the common unkown-length case (is that automatic? Can that be configured to error?).

Methods in the dask.dataframe API should be free to use _lens when the attribute is not set to None. For now, there are not many cases where we need this information (just __len__ for now), and there are many cases where we do not want to require a dask collection to keep track of the length of every partition (shuffling, filtering, merging, etc…). So, we do not want to raise an error when the information is lost.

Note that there are probably other methods, besides __len__, that could make use of _lens. I could also imagine that many users would find a (fast) drop_empty_partitions method to be very useful.

A proposal for how methods can indicate that they're shape-preserving. e.g. DataFrame.rename(columns=...) will not affect the rows, so the lengths are preserved. But DataFrame[mask] would have to invalidate the rows.

I think the primary concern is that we can propagate known lengths through elemwise methods, and some map_partitions operations. Therefore, I suspect that we will only need to add something like a preserve_partition_lens kwarg to map_partitions (which will indicate that _lens should be passed into the final new_dd_object call).

jsignell · 2021-07-22T13:48:49Z

dask/dataframe/core.py

    meta=no_default,
    enforce_metadata=True,
    transform_divisions=True,
+    length_preserving=False,


Nitpick, but the other kwargs are {verb}_{noun} how would you feel about preserve_length

Makes sense. I have no idea what to call this - So, your opinion is useful, and not just a nitpick :)

jsignell · 2021-07-22T13:51:34Z

dask/dataframe/core.py

            )
        self._meta = meta
        self.divisions = tuple(divisions)
+        self._lens = lens


Should there be any validation? Like does it need to be the same len as divisions - 1?

mrocklin

Thank you for experimenting with this @rjzamora . I'm currently -1 on this approach. I think that it elevates a relatively small optimization too much too high a level in the abstraction. For example, someone could easy "well, I want to compute max" and then we have to add maxes everywhere. Same with uniqueness, emptiness, min-ness, etc.. I think that we can still solve what you want to solve, but that we should find another way.

mrocklin · 2021-07-22T15:06:54Z

dask/dataframe/core.py

-    def __init__(self, dsk, name, meta, divisions=None):
-        # divisions is ignored, only present to be compatible with other
-        # objects.
+    def __init__(self, dsk, name, meta, divisions=None, lens=None):


If we're going to add this, let's use the full word, "lengths", which I think will be clearer.

mrocklin · 2021-07-22T15:07:52Z

dask/dataframe/core.py

    """

-    def __init__(self, dsk, name, meta, divisions):
+    def __init__(self, dsk, name, meta, divisions, lens=None):


I am currently -1 to putting this in the DataFrame constructor. I think that this is too niche of a topic I think to be on the same level as meta/divisions. I think that we can find another way.

mrocklin · 2021-07-22T15:11:11Z

In general I think that you should expect any change to the core metadata of dataframes on top of meta/divisions/graph will be met with extreme levels of scrutiny :)

rjzamora · 2021-07-22T16:43:21Z

Thanks for taking a look @mrocklin ! I am not happy with this "proposal" yet, so your comments are helpful.

in general I think that you should expect any change to the core metadata of dataframes on top of meta/divisions/graph will be met with extreme levels of scrutiny :)

No worries. I absolutely expected a -1 from you :)

With that said, I am fairly certain that we do need to come up with a way to store/track this kind of information. So, I am doing my best to experiment with different solutions in an open-minded way.

Note that I locally implemented a Layer-centered solution that works fine, but I switched to the _Frame attribute when I realized this feature is very similar to Array._chunks, and would be much more useful at the dask.dataframe API level than at the HLG Layer level. I also realized that, after I finish the required groundwork to ensure that all Dataframe-specific Layers were based on DataFrameLayer, we would probably be uncomfortable with partition-length information being tracked there as well. Overall, the partition-length information does not feel like it should be attached to the graph to me.

To be clear, I am very hesitant to change the _Frame/DataFrame/Series APIs. I think it is a bit obvious that collection-specific information needs to be attached to the "collection" object itself in some way, but I don't want it to live at the same level as meta/divisions/graph.

For example, someone could easy "well, I want to compute max" and then we have to add maxes everywhere. Same with uniqueness, emptiness, min-ness, etc.. I think that we can still solve what you want to solve, but that we should find another way.

I also had the same "slippery-slope" thought that users may want to track other "collection statistics" to optimize similar operations.

Perhaps a reasonable compromise here is to (1) change _lens into a more-general _partition_statistics attribute, and to (2) treat this attribute as a “second-class citizen”. That is, we could add explicit set/get methods to attach/access these optional statistics, and leave the optional attribute/information out of _Frame/DataFrame/Series initialization. This way, we would avoid any “public” API changes, but provide a formal mechanism/location for this type of information to be stored/propagated. I realize that this idea is likely to get a -1 as well, but maybe I’m getting a bit closer?

jsignell · 2021-07-22T17:36:08Z

Just as a note: Array has been doing a bit of this kind of pattern with cached_property. I personally would prefer to have _len as a cached_property of dataframe objects rather than having a grab bag of _partition_statistics.

rjzamora · 2021-07-22T22:28:31Z

@jsignell - I decided to experiment with your cached_property idea. Please feel welcome to advise :)

pyrito · 2021-12-17T20:52:15Z

Is this PR still active? Are there plans to get this merged in anytime soon? @rjzamora

jsignell · 2021-12-20T16:09:40Z

I suspect that this got superseded by the discussions around a high-level expression system for encapsulating all the dataframe metadata #7933

rjzamora added 2 commits July 19, 2021 09:23

first pass to collect global length while processing statistics

2e33d45

collect length by partition (required in nvtabular)

f2cd0d0

github-actions bot added dataframe io labels Jul 19, 2021

jsignell reviewed Jul 19, 2021

View reviewed changes

jsignell linked an issue Jul 21, 2021 that may be closed by this pull request

Use parquet metadata to get length #6387

Open

rjzamora added 2 commits July 21, 2021 12:27

use _Frame _lens attribute to (optionally) store partition lens

8ce4d5e

cover column selection

01e4b37

jsignell reviewed Jul 22, 2021

View reviewed changes

mrocklin reviewed Jul 22, 2021

View reviewed changes

rjzamora added 2 commits July 22, 2021 14:29

use cached_property _length

31266ed

add test coverage

cb440f7

rjzamora added 2 commits July 22, 2021 19:52

set __lengths in init

79bccfa

check if _Frame in elemwise

ff9e43f

jsignell mentioned this pull request Mar 15, 2022

Add partition lengths to DataFrame metadata. #5633

Open

rjzamora mentioned this pull request Sep 8, 2022

[POC] Introduce partition_metadata attribute to DataFrame #9473

Draft

rjzamora mentioned this pull request Feb 21, 2023

Get length of parquet-backed dask dataframe from metadata #9973

Closed

rjzamora closed this Jun 4, 2024

Uh oh!

Conversation

rjzamora commented Jul 19, 2021

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Jul 19, 2021

Uh oh!

mrocklin commented Jul 19, 2021 via email

Uh oh!

rjzamora commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Jul 20, 2021

Uh oh!

rjzamora commented Jul 20, 2021

Uh oh!

martindurant commented Jul 20, 2021

Uh oh!

jsignell commented Jul 21, 2021

Uh oh!

rjzamora commented Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

rjzamora Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

jsignell Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

mrocklin Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 22, 2021

Uh oh!

rjzamora commented Jul 22, 2021

Uh oh!

jsignell commented Jul 22, 2021

Uh oh!

rjzamora commented Jul 22, 2021

Uh oh!

pyrito commented Dec 17, 2021

Uh oh!

jsignell commented Dec 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rjzamora commented Jul 19, 2021 •

edited

Loading

rjzamora commented Jul 21, 2021 •

edited

Loading