Preserve ordering of user-provided path list in read_parquet#10061
Preserve ordering of user-provided path list in read_parquet#10061
read_parquet#10061Conversation
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @rjzamora! I'm curious what's stopping us from just using the same file order if a list of paths are provided? This seems niche enough that adding a new public keyword to read_parquet feels like too much.
The only reason is that it would technically be a "breaking" change. The convention to use natural sort order pre-dates my involvement with Dask, and may even pre-date pandas' ability to handle a list of files. That said, I do think it would be reasonable to eventually change the default behavior when the user provides a list of paths. |
|
I personally wouldn't worry too much about the change in behavior when a list of files is provided (we should continue to use natural sort when globbing directories). TIL In [7]: pd.read_parquet(["foo.parquet/part.0.parquet", "foo.parquet/part.1.parquet"]).head()
Out[7]:
name id x y
timestamp
2000-01-01 00:00:00 George 908 0.031730 -0.647870
2000-01-01 00:00:01 Sarah 961 0.481720 -0.162815
2000-01-01 00:00:02 Victor 1024 -0.050353 0.729972
2000-01-01 00:00:03 Laura 969 -0.122235 0.652741
2000-01-01 00:00:04 Ray 995 -0.081051 -0.496389
In [8]: pd.read_parquet(["foo.parquet/part.1.parquet", "foo.parquet/part.0.parquet"]).head()
Out[8]:
name id x y
timestamp
2000-01-02 00:00:00 Kevin 983 0.620474 0.586924
2000-01-02 00:00:01 Ursula 979 -0.186794 -0.714156
2000-01-02 00:00:02 Ray 1015 -0.919861 -0.421275
2000-01-02 00:00:03 Jerry 957 0.215248 -0.702655
2000-01-02 00:00:04 Ingrid 999 0.367679 -0.838501 |
|
You make a good point that, although "breaking", it is pretty-easy to motivate the decision to avoid sorting a user-provided list of paths ("pandas doesn't do it"). |
dask/dataframe/io/parquet/utils.py
Outdated
| def _sort_glob_paths(paths, urlpath): | ||
| # Always use "natural" order to sort a list of | ||
| # paths that originated from a "glob" pattern. | ||
| # We assume this is the case if the `paths` list | ||
| # is longer than the user-provided `urlpath`. | ||
| urlpath = urlpath if isinstance(urlpath, list) else [urlpath] | ||
| if len(paths) > len(urlpath): | ||
| return _maybe_sort_paths(paths) | ||
| return paths |
There was a problem hiding this comment.
@jrbourbeau - The plan to align with pandas feels pretty clean for the case of a user-provided path list.
One remaining complication is what to do about a glob pattern (e.g. "/data/*.parquet" or ["/data/month=10/*.parquet", "/data/month=9/*.parquet"]). Since pandas does not support glob patterns, we cannot use pd.read_parquet as a guide for these cases.
In the latest commit (7af2e61), I am assuming that we should still use a natural sort on the "expanded" path list if there is a glob pattern anywhere in the user-provided path string or list. Do you think this makes sense? Or, should we start using "glob ordering" when users provide a glob pattern? (my intuition is that this would be a bit too much of a "breaking" change)
There was a problem hiding this comment.
Another note on the ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] case: It would be difficult to preserve the user-provided list order while also using natural ordering to expand each of the glob patterns. This is because we lean on fsspec to expand the nested paths into a flat list.
There was a problem hiding this comment.
(though we can make it happen if a "hybrid" approach is clearly best - which may be the case)
There was a problem hiding this comment.
Apologies for the delayed response.
I am assuming that we should still use a natural sort on the "expanded" path list if there is a glob pattern anywhere in the user-provided path string or list
Yeah, I think I agree with that. Continuing to use natural ordering for a single glob string (e.g. "/data/month=10/*.parquet") makes sense. My guess is we do that today so that files like
foo.parquet
├── part.0.parquet
├── part.1.parquet
...
└── part.9.parquet
have the number in the filepath correspond to the partition in the DataFrame.
For the ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] case, what are your thoughts on just using our existing natural sorting in a loop so we maintain list order relative to each expanded glob string, but natural ordering within each glob string. For example, if we expand ["/data/month=10/*.parquet", "/data/month=9/*.parquet"] we get something like:
[
"/data/month=10/00.parquet",
"/data/month=10/01.parquet",
"/data/month=10/02.parquet",
...,
"/data/month=9/00.parquet",
"/data/month=9/01.parquet",
"/data/month=9/02.parquet",
...
]
There was a problem hiding this comment.
what are your thoughts on just using our existing natural sorting in a loop so we maintain list order relative to each expanded glob string, but natural ordering within each glob string.
Right. This was my initial thought as well, but I held off on implementing it after I realized we rely entirely on fsspec to expand paths, and would need to modify fsspec and/or re-implement custom utilities in dask.
There was a problem hiding this comment.
Never mind. This shouldn't be too difficult.
There was a problem hiding this comment.
Okay - Latest commit should now "properly" handle the list-of-globs case discussed here (preserve list order while sorting expanded glob patterns for each item in the user-provided list).
sort_key option to read_parquetread_parquet
|
Any other thoughts on this one @jrbourbeau ? I like your idea to preserve list order, but not sure if you agree with glob-pattern reordering. |
pyproject.toml
Outdated
| "click >= 7.0", | ||
| "cloudpickle >= 1.1.1", | ||
| "fsspec >= 0.6.0", | ||
| "fsspec >= 0.7.4", |
There was a problem hiding this comment.
Need fsspec>=0.7.40.8.0 (just under three-years old) to use url_to_fs. Not sure if I found all the places this pin needs to be changed.
|
I'm assuming codecov is wrong, and CI is currently passing here. @jrbourbeau - Let me know if you have any concerns with changing the |
jrbourbeau
left a comment
There was a problem hiding this comment.
I'm assuming codecov is wrong
Yeah, me too. xref #10165
Let me know if you have any concerns with changing the fsspec mindep
Nope! In fact, @charlesbluca is bumping it to fsspec>=2021.9.0 in #10161
The current
read_parquetimplementation includes several hard-coded path-sorting operations, and multiple users have expressed an interest if making it possible to override this behavior.The solution in this PR is to add a new(as suggested by @alexgorban). [EDIT: The revised approach is to preserve the order of user-provided path-list input, but to use a "natural" sort for input containing a directory or glob-string pattern]sort_keyoptionpre-commit run --all-files