Skip to content

Preserve ordering of user-provided path list in read_parquet#10061

Open
rjzamora wants to merge 16 commits intodask:mainfrom
rjzamora:sort_paths-option
Open

Preserve ordering of user-provided path list in read_parquet#10061
rjzamora wants to merge 16 commits intodask:mainfrom
rjzamora:sort_paths-option

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Mar 14, 2023

The current read_parquet implementation includes several hard-coded path-sorting operations, and multiple users have expressed an interest if making it possible to override this behavior. The solution in this PR is to add a new sort_key option (as suggested by @alexgorban). [EDIT: The revised approach is to preserve the order of user-provided path-list input, but to use a "natural" sort for input containing a directory or glob-string pattern]

@rjzamora rjzamora added dataframe parquet enhancement Improve existing functionality or make things work better labels Mar 14, 2023
@github-actions github-actions bot added the io label Mar 14, 2023
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora! I'm curious what's stopping us from just using the same file order if a list of paths are provided? This seems niche enough that adding a new public keyword to read_parquet feels like too much.

@rjzamora
Copy link
Member Author

I'm curious what's stopping us from just using the same file order if a list of paths are provided?

The only reason is that it would technically be a "breaking" change. The convention to use natural sort order pre-dates my involvement with Dask, and may even pre-date pandas' ability to handle a list of files. That said, I do think it would be reasonable to eventually change the default behavior when the user provides a list of paths.

@jrbourbeau
Copy link
Member

I personally wouldn't worry too much about the change in behavior when a list of files is provided (we should continue to use natural sort when globbing directories). TIL pandas.read_parquet supports a list of files! It also looks like pandas uses the given file order when handed a list of file paths:

In [7]: pd.read_parquet(["foo.parquet/part.0.parquet", "foo.parquet/part.1.parquet"]).head()
Out[7]:
                       name    id         x         y
timestamp
2000-01-01 00:00:00  George   908  0.031730 -0.647870
2000-01-01 00:00:01   Sarah   961  0.481720 -0.162815
2000-01-01 00:00:02  Victor  1024 -0.050353  0.729972
2000-01-01 00:00:03   Laura   969 -0.122235  0.652741
2000-01-01 00:00:04     Ray   995 -0.081051 -0.496389

In [8]: pd.read_parquet(["foo.parquet/part.1.parquet", "foo.parquet/part.0.parquet"]).head()
Out[8]:
                       name    id         x         y
timestamp
2000-01-02 00:00:00   Kevin   983  0.620474  0.586924
2000-01-02 00:00:01  Ursula   979 -0.186794 -0.714156
2000-01-02 00:00:02     Ray  1015 -0.919861 -0.421275
2000-01-02 00:00:03   Jerry   957  0.215248 -0.702655
2000-01-02 00:00:04  Ingrid   999  0.367679 -0.838501

@rjzamora
Copy link
Member Author

You make a good point that, although "breaking", it is pretty-easy to motivate the decision to avoid sorting a user-provided list of paths ("pandas doesn't do it").

Comment on lines +520 to +528
def _sort_glob_paths(paths, urlpath):
# Always use "natural" order to sort a list of
# paths that originated from a "glob" pattern.
# We assume this is the case if the `paths` list
# is longer than the user-provided `urlpath`.
urlpath = urlpath if isinstance(urlpath, list) else [urlpath]
if len(paths) > len(urlpath):
return _maybe_sort_paths(paths)
return paths
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau - The plan to align with pandas feels pretty clean for the case of a user-provided path list.

One remaining complication is what to do about a glob pattern (e.g. "/data/*.parquet" or ["/data/month=10/*.parquet", "/data/month=9/*.parquet"]). Since pandas does not support glob patterns, we cannot use pd.read_parquet as a guide for these cases.

In the latest commit (7af2e61), I am assuming that we should still use a natural sort on the "expanded" path list if there is a glob pattern anywhere in the user-provided path string or list. Do you think this makes sense? Or, should we start using "glob ordering" when users provide a glob pattern? (my intuition is that this would be a bit too much of a "breaking" change)

Copy link
Member Author

@rjzamora rjzamora Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note on the ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] case: It would be difficult to preserve the user-provided list order while also using natural ordering to expand each of the glob patterns. This is because we lean on fsspec to expand the nested paths into a flat list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though we can make it happen if a "hybrid" approach is clearly best - which may be the case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response.

I am assuming that we should still use a natural sort on the "expanded" path list if there is a glob pattern anywhere in the user-provided path string or list

Yeah, I think I agree with that. Continuing to use natural ordering for a single glob string (e.g. "/data/month=10/*.parquet") makes sense. My guess is we do that today so that files like

foo.parquet
├── part.0.parquet
├── part.1.parquet
...
└── part.9.parquet

have the number in the filepath correspond to the partition in the DataFrame.

For the ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] case, what are your thoughts on just using our existing natural sorting in a loop so we maintain list order relative to each expanded glob string, but natural ordering within each glob string. For example, if we expand ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] we get something like:

[
"/data/month=10/00.parquet",
"/data/month=10/01.parquet",
"/data/month=10/02.parquet",
...,
"/data/month=9/00.parquet",
"/data/month=9/01.parquet",
"/data/month=9/02.parquet",
...
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are your thoughts on just using our existing natural sorting in a loop so we maintain list order relative to each expanded glob string, but natural ordering within each glob string.

Right. This was my initial thought as well, but I held off on implementing it after I realized we rely entirely on fsspec to expand paths, and would need to modify fsspec and/or re-implement custom utilities in dask.

Copy link
Member Author

@rjzamora rjzamora Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. This shouldn't be too difficult.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - Latest commit should now "properly" handle the list-of-globs case discussed here (preserve list order while sorting expanded glob patterns for each item in the user-provided list).

@rjzamora rjzamora changed the title Add sort_key option to read_parquet Preserve ordering of user-provided path list in read_parquet Mar 20, 2023
@rjzamora
Copy link
Member Author

Any other thoughts on this one @jrbourbeau ? I like your idea to preserve list order, but not sure if you agree with glob-pattern reordering.

pyproject.toml Outdated
"click >= 7.0",
"cloudpickle >= 1.1.1",
"fsspec >= 0.6.0",
"fsspec >= 0.7.4",
Copy link
Member Author

@rjzamora rjzamora Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need fsspec>=0.7.40.8.0 (just under three-years old) to use url_to_fs. Not sure if I found all the places this pin needs to be changed.

@rjzamora
Copy link
Member Author

I'm assuming codecov is wrong, and CI is currently passing here. @jrbourbeau - Let me know if you have any concerns with changing the fsspec mindep.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming codecov is wrong

Yeah, me too. xref #10165

Let me know if you have any concerns with changing the fsspec mindep

Nope! In fact, @charlesbluca is bumping it to fsspec>=2021.9.0 in #10161

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe enhancement Improve existing functionality or make things work better io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Custom sort key for dataframe.read_parquet with multiple paths as input dask DataFrame.read_parquet doesn't preserve the paths order

2 participants