Preserve ordering of user-provided path list in `read_parquet` by rjzamora · Pull Request #10061 · dask/dask

rjzamora · 2023-03-14T15:57:48Z

The current read_parquet implementation includes several hard-coded path-sorting operations, and multiple users have expressed an interest if making it possible to override this behavior. ~~The solution in this PR is to add a new sort_key option~~ (as suggested by @alexgorban). [EDIT: The revised approach is to preserve the order of user-provided path-list input, but to use a "natural" sort for input containing a directory or glob-string pattern]

Closes Custom sort key for dataframe.read_parquet with multiple paths as input #10045
Closes dask DataFrame.read_parquet doesn't preserve the paths order #8829
Tests added / passed
Passes pre-commit run --all-files

jrbourbeau

Thanks @rjzamora! I'm curious what's stopping us from just using the same file order if a list of paths are provided? This seems niche enough that adding a new public keyword to read_parquet feels like too much.

rjzamora · 2023-03-14T16:38:30Z

I'm curious what's stopping us from just using the same file order if a list of paths are provided?

The only reason is that it would technically be a "breaking" change. The convention to use natural sort order pre-dates my involvement with Dask, and may even pre-date pandas' ability to handle a list of files. That said, I do think it would be reasonable to eventually change the default behavior when the user provides a list of paths.

jrbourbeau · 2023-03-14T16:50:07Z

I personally wouldn't worry too much about the change in behavior when a list of files is provided (we should continue to use natural sort when globbing directories). TIL pandas.read_parquet supports a list of files! It also looks like pandas uses the given file order when handed a list of file paths:

In [7]: pd.read_parquet(["foo.parquet/part.0.parquet", "foo.parquet/part.1.parquet"]).head()
Out[7]:
                       name    id         x         y
timestamp
2000-01-01 00:00:00  George   908  0.031730 -0.647870
2000-01-01 00:00:01   Sarah   961  0.481720 -0.162815
2000-01-01 00:00:02  Victor  1024 -0.050353  0.729972
2000-01-01 00:00:03   Laura   969 -0.122235  0.652741
2000-01-01 00:00:04     Ray   995 -0.081051 -0.496389

In [8]: pd.read_parquet(["foo.parquet/part.1.parquet", "foo.parquet/part.0.parquet"]).head()
Out[8]:
                       name    id         x         y
timestamp
2000-01-02 00:00:00   Kevin   983  0.620474  0.586924
2000-01-02 00:00:01  Ursula   979 -0.186794 -0.714156
2000-01-02 00:00:02     Ray  1015 -0.919861 -0.421275
2000-01-02 00:00:03   Jerry   957  0.215248 -0.702655
2000-01-02 00:00:04  Ingrid   999  0.367679 -0.838501

rjzamora · 2023-03-14T17:44:27Z

You make a good point that, although "breaking", it is pretty-easy to motivate the decision to avoid sorting a user-provided list of paths ("pandas doesn't do it").

rjzamora · 2023-03-15T14:36:26Z

dask/dataframe/io/parquet/utils.py

+def _sort_glob_paths(paths, urlpath):
+    # Always use "natural" order to sort a list of
+    # paths that originated from a "glob" pattern.
+    # We assume this is the case if the `paths` list
+    # is longer than the user-provided `urlpath`.
+    urlpath = urlpath if isinstance(urlpath, list) else [urlpath]
+    if len(paths) > len(urlpath):
+        return _maybe_sort_paths(paths)
+    return paths


@jrbourbeau - The plan to align with pandas feels pretty clean for the case of a user-provided path list.

One remaining complication is what to do about a glob pattern (e.g. "/data/*.parquet" or ["/data/month=10/*.parquet", "/data/month=9/*.parquet"]). Since pandas does not support glob patterns, we cannot use pd.read_parquet as a guide for these cases.

In the latest commit (7af2e61), I am assuming that we should still use a natural sort on the "expanded" path list if there is a glob pattern anywhere in the user-provided path string or list. Do you think this makes sense? Or, should we start using "glob ordering" when users provide a glob pattern? (my intuition is that this would be a bit too much of a "breaking" change)

Another note on the ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] case: It would be difficult to preserve the user-provided list order while also using natural ordering to expand each of the glob patterns. This is because we lean on fsspec to expand the nested paths into a flat list.

(though we can make it happen if a "hybrid" approach is clearly best - which may be the case)

Apologies for the delayed response.

I am assuming that we should still use a natural sort on the "expanded" path list if there is a glob pattern anywhere in the user-provided path string or list

Yeah, I think I agree with that. Continuing to use natural ordering for a single glob string (e.g. "/data/month=10/*.parquet") makes sense. My guess is we do that today so that files like

foo.parquet ├── part.0.parquet ├── part.1.parquet ... └── part.9.parquet

have the number in the filepath correspond to the partition in the DataFrame.

For the ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] case, what are your thoughts on just using our existing natural sorting in a loop so we maintain list order relative to each expanded glob string, but natural ordering within each glob string. For example, if we expand ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] we get something like:

[ "/data/month=10/00.parquet", "/data/month=10/01.parquet", "/data/month=10/02.parquet", ..., "/data/month=9/00.parquet", "/data/month=9/01.parquet", "/data/month=9/02.parquet", ... ]

what are your thoughts on just using our existing natural sorting in a loop so we maintain list order relative to each expanded glob string, but natural ordering within each glob string.

Right. This was my initial thought as well, but I held off on implementing it after I realized we rely entirely on fsspec to expand paths, and would need to modify fsspec and/or re-implement custom utilities in dask.

Never mind. This shouldn't be too difficult.

Okay - Latest commit should now "properly" handle the list-of-globs case discussed here (preserve list order while sorting expanded glob patterns for each item in the user-provided list).

rjzamora · 2023-03-28T17:34:51Z

Any other thoughts on this one @jrbourbeau ? I like your idea to preserve list order, but not sure if you agree with glob-pattern reordering.

rjzamora · 2023-04-07T15:53:45Z

pyproject.toml

    "click >= 7.0",
    "cloudpickle >= 1.1.1",
-    "fsspec >= 0.6.0",
+    "fsspec >= 0.7.4",


Need fsspec>=~~0.7.4~~0.8.0 (just under three-years old) to use url_to_fs. Not sure if I found all the places this pin needs to be changed.

rjzamora · 2023-04-10T21:21:56Z

I'm assuming codecov is wrong, and CI is currently passing here. @jrbourbeau - Let me know if you have any concerns with changing the fsspec mindep.

jrbourbeau

I'm assuming codecov is wrong

Yeah, me too. xref #10165

Let me know if you have any concerns with changing the fsspec mindep

Nope! In fact, @charlesbluca is bumping it to fsspec>=2021.9.0 in #10161

dask/dataframe/io/parquet/core.py

…into sort_paths-option

rjzamora added 3 commits March 14, 2023 08:34

add sort_key option

5c8008d

doc-string tweak

c3d7e2c

Merge remote-tracking branch 'upstream/main' into sort_paths-option

1e3ee54

rjzamora added dataframe parquet enhancement Improve existing functionality or make things work better labels Mar 14, 2023

github-actions bot added the io label Mar 14, 2023

jrbourbeau reviewed Mar 14, 2023

View reviewed changes

avoid sorting user-provided path lists in all cases

7af2e61

rjzamora commented Mar 15, 2023

View reviewed changes

update docstring

cb1a27e

rjzamora changed the title ~~Add sort_key option to read_parquet~~ Preserve ordering of user-provided path list in read_parquet Mar 20, 2023

rjzamora added 2 commits March 29, 2023 08:21

handle list of globs

770bf6f

Merge remote-tracking branch 'upstream/main' into sort_paths-option

9559ea7

rjzamora mentioned this pull request Mar 29, 2023

Custom sort key for dataframe.read_parquet with multiple paths as input #10045

Open

rjzamora and others added 5 commits April 4, 2023 09:20

fix grammar

6aa278b

remove extra parentheses

a2959c7

Merge remote-tracking branch 'upstream/main' into sort_paths-option

07b31b0

avoid dtype issue in test

1cb13eb

update minimum fsspec version

4608eaa

rjzamora commented Apr 7, 2023

View reviewed changes

use fsspec-0.8.0

a160eec

jrbourbeau reviewed Apr 10, 2023

View reviewed changes

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

tweak docstring

e679634

rjzamora commented Apr 10, 2023

View reviewed changes

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

rjzamora and others added 2 commits April 10, 2023 17:34

Update dask/dataframe/io/parquet/core.py

8f07d04

Merge branch 'sort_paths-option' of https://github.com/rjzamora/dask …

d68cdc6

…into sort_paths-option

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Feb 26, 2024

TomAugspurger mentioned this pull request Apr 3, 2025

April, 2025 Community Meeting dask/community#417

Closed

Uh oh!

Conversation

rjzamora commented Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Mar 14, 2023

Uh oh!

jrbourbeau commented Mar 14, 2023

Uh oh!

rjzamora commented Mar 14, 2023

Uh oh!

rjzamora Mar 15, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 15, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Mar 28, 2023

Uh oh!

rjzamora Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Apr 10, 2023

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjzamora commented Mar 14, 2023 •

edited

Loading

rjzamora Mar 15, 2023 •

edited

Loading

rjzamora Mar 29, 2023 •

edited

Loading

rjzamora Apr 7, 2023 •

edited

Loading