[POC] Introduce partition_metadata attribute to DataFrame by rjzamora · Pull Request #9473 · dask/dask

rjzamora · 2022-09-08T02:28:58Z

This is a rough POC intended to illustrate how a new PartitionMetadata ~~and PartitionStatistics classes~~ class can be used to accomplish a few impactful goals in Dask-DataFrame The Goals:

Isolate DataFrame-collection metadata (meta and divisions) in one place (simplifying future expansion and possible movement into HLG/Layer)
Add a mechanism to track column partitioning beyond the conventional divisions. Motivation: Track whether DataFrame partitions are sorted, even if divisions are unknown? #9425
Add a mechanism to track partition-wise statistics (e.g. partition lengths and min/max statistics for each column). Motivation: Add partition lengths to DataFrame metadata. #5633, Use parquet metadata to get length #6387, Use parquet metadata to calculate len #7912

If we ultimately decide that a change like this make sense, I suggest that we break the work into three distinct stages:

Move meta and divisions management into a new/distinct "partition_metadata" attribute using a new PartitionMetadata class (specific attribute and class names are up for debate)
Add partitioned_by functionality to PartitionMetadata (and therefore DataFrame/Series)
Add mechanism to track partition statistics to PartitionMetadata

TODO: Add a clearer breakdown of the proposed design changes, and include some explicit user-code examples.

…stats

ian-r-rose

Sorry to take so long to read through this @rjzamora! I'm very much in favor of the general approach you've taken here. I think that the relatively simple (one could say impoverished) set of metadata tracked on dask dataframes prevents a number of possible improvements, and it's due for a redesign.

Most of my comments are in the vein of trying to make an API that is difficult to get wrong, while retaining the richness that we want. I realize this is a POC, so excuse me if my comments are overly detailed, or are towards things that you deliberately decided to defer until later.

ian-r-rose · 2022-09-26T20:28:57Z

dask/dataframe/core.py

+                # it is also partitioned by ("A", "B", ...)
+                if _by[: len(group)] == group:
+                    return self.partition_metadata.partitioning[group]
+        return False


Rather than return False for a non string, list, or tuple type, I think it would make more sense to just raise a TypeError, and document reasonable types for the input

ian-r-rose · 2022-09-26T20:45:02Z

dask/dataframe/core.py

+      - <column-name>: DataFrame of stats for the column
+        - Required DataFrame columns: "min" and "max"


Why use a dataframe? For something this small, I'd imagine a namedtuple or dataclass might be easier. Though perhaps I'm still using the old way of thinking where we try to avoid using numpy/pandas for graph logic.

Yeah, I was certainly unsure about this. My only insight from parsing statistics in the parquet world is that it is much faster to do things with the column statistics when they are represented in an array-like format. For example, Parquet effectively gives you a distinct {"col0": {"min": <val>, "max": <val>}} dictionary for each row-groupn, and this makes it extremely slow to calculate divisions or apply filters for many-partition datasets. In order to aggregate multiple row-groups into a dask partition, we typically start by converting to a pandas DataFrame.

I should note that I'm not particularly fond of the current design here just yet. I was certainly aiming for rough changes that someone like you would help me improve :)

ian-r-rose · 2022-09-26T20:48:13Z

dask/dataframe/core.py

+
+    def __init__(
+        self,
+        statistics: dict | None = None,


This interface smells a bit off to me. Rather than accept a single dictionary with very specific structure, why not just use constructor arguments? It's unfortunate for greenfield API design to already have magic values ("__num_rows__").

Yes - The "__num_rows__" magic key needs to go!

What I am trying to do here is come up with a way to both separate and couple the various types of partition statistics at the same time. This is because we want to be able to access both column and num-rows statistics individually, but we also want to be able to use the same callback (maybe delayed?) function to load multiple statistics "lazily" at the same time.

I am still struggling a bit to come up with an elegant solution for this.

ian-r-rose · 2022-09-26T20:49:11Z

dask/dataframe/core.py

+    ):
+        self._statistics = statistics or {}
+
+    def copy(self, keys: set | None = None) -> PartitionStatistics:


Perhaps columns instead of keys? key is already a bit overloaded.

ian-r-rose · 2022-09-26T20:51:25Z

dask/dataframe/core.py

+    @property
+    def available_stats(self) -> set:
+        """Return all available partition-statistic keys"""
+        return self.known_stats | self.lazy_stats


Isn't this the same as set(self._statistics.keys())?

ian-r-rose · 2022-09-26T21:24:08Z

dask/dataframe/core.py

+    def _divisions(self):
+        # _divisions Compatability
+        raise FutureWarning(
+            "_Frame._divisions is depracated. " "Please use _Frame.divisions"


This future warning seems wrong to me?

ian-r-rose · 2022-09-26T21:25:58Z

dask/dataframe/core.py

+            part_sizes = self.partition_metadata.get_stats({"__num_rows__"})[
+                "__num_rows__"
+            ]
+            if part_sizes:
+                return sum(part_sizes)
+        except KeyError:
+            pass


I think this should just be an optional top-level attribute on the metadata, rather than keying into a dictionary with a magic name.

ian-r-rose · 2022-09-26T21:29:33Z

dask/dataframe/core.py


+            # Use partition statistics to check if new index is already sorted
+            if not pre_sorted and divisions is None:
+                try:


This is very exciting to see.

From an API standpoint, I'd like to have a way to determine if a partition is sorted without possibly raising a KeyError in normal usage. That is to say, I think this is a totally reasonable request of the metadata, so it shouldn't raise a KeyError.

ian-r-rose · 2022-09-26T21:33:09Z

dask/dataframe/core.py

+    meta=no_default,
+    out=None,
+    transform_divisions=True,
+    partition_metadata=None,


Who is responsible for making sure that meta and partition_metadata are in sync? For more internal APIs such as this, I'd probably want to just replace meta with partition_metadata.

ian-r-rose · 2022-09-26T21:37:25Z

dask/dataframe/core.py



-def new_dd_object(dsk, name, meta, divisions, parent_meta=None):
+def new_dd_object(dsk, name, meta, divisions, parent_meta=None, **metdata_kwargs):


This looks like a great opportunity for consolidation of kwargs into a single metadata object (I suspect that's on your mind here, but out-of-scope for this POC):

def new_dd_object(dsk, name, partition_metadata):

Removes unnecessary code from `dask_cudf.core._Frame` that is already handled in the super-class (`dask.dataframe.core._Frame`). By removing the unnecessary `__init__` logic from `dask_cudf`, we can avoid breakages from upstream changes like dask/dask#9473. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #12001

rjzamora added 13 commits September 1, 2022 15:11

refactor metadata handling in dask-dataframe and introduce partition_…

9200e8e

…stats

simplify api

7786537

fix test_dataframe.py failures

4904272

Merge remote-tracking branch 'upstream/main' into partition-stats

afd1d57

use __num_rows__ instead of num-rows to avoid conflict with column names

30f964c

use partition_metadata in more places

0ba59a2

add future wanrings for _divisions

90cae8f

save state

4a28a5c

fix divisions copying

9d63a7c

introduce PAatitionStatistics class

fe4eb84

fix meta typo

c6b589f

fix stat loading bug

34f8175

avoid shuffle in hash_join for pre-partitioned data

564fc97

rjzamora added dataframe enhancement Improve existing functionality or make things work better labels Sep 8, 2022

github-actions bot added the io label Sep 8, 2022

rjzamora added 3 commits September 8, 2022 05:43

fix test failures

2eb7b38

use set_partition_metadata

59ee950

skip shuffles for lhs and rhs independently

a5ceb70

ian-r-rose reviewed Sep 26, 2022

View reviewed changes

randerzander mentioned this pull request Oct 11, 2022

[ENH] Support table statistics based query optimization rules dask-contrib/dask-sql#851

Closed

rjzamora added 9 commits October 21, 2022 11:45

Merge remote-tracking branch 'upstream/main' into partition-stats

61642cf

change 'order' to 'ascending'

7ca4b1c

remove __num_rows__ magic key

0714789

simplify API a bit

e97ddc3

fix set_index bug

99c8d67

fix test_parquet failures

60fb3fd

raise TypeErro for unexpected partitioned_by argument

09f100f

accept PartitionMetadata in place of meta

6592343

drop PartitionStatistics class

eaaeda5

rjzamora added 2 commits October 26, 2022 07:28

consolidate logic in PartitionMetadata

c32b879

Merge remote-tracking branch 'upstream/main' into partition-stats

b796ec5

github-actions bot added the dispatch Related to `Dispatch` extension objects label Oct 26, 2022

more cleanup

5aae790

rjzamora mentioned this pull request Oct 26, 2022

Remove unnecessary code from dask-cudf _Frame rapidsai/cudf#12001

Merged

3 tasks

rjzamora added 6 commits October 26, 2022 10:14

remove redundant make_meta

93273a3

address test_numeric_column_names failure

6510a9e

move partitioned_by to PArtitionMetadata

01878b8

add min/max statistics to from_pandas

79c03d3

change min/max statistics a bit

fdb46de

minor cleanup and experimentation

64403a9

TomAugspurger mentioned this pull request Nov 2, 2022

WIP: DataFrame.iloc implementation (backed by partition_sizes member) #6661

Closed

14 tasks

rjzamora mentioned this pull request Nov 18, 2022

[WIP] Partition-metadata tracking in Dask-DataFrame dask/design-docs#3

Draft

rjzamora mentioned this pull request Jan 19, 2023

Track whether DataFrame partitions are sorted, even if divisions are unknown? #9425

Open

TomAugspurger mentioned this pull request Feb 13, 2023

API for filtering dataframes by partition #9951

Closed

rjzamora mentioned this pull request Feb 21, 2023

Get length of parquet-backed dask dataframe from metadata #9973

Closed

		- <column-name>: DataFrame of stats for the column
		- Required DataFrame columns: "min" and "max"



		def new_dd_object(dsk, name, meta, divisions, parent_meta=None):
		def new_dd_object(dsk, name, meta, divisions, parent_meta=None, **metdata_kwargs):

Uh oh!

Conversation

rjzamora commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjzamora commented Sep 8, 2022 •

edited

Loading