Ensure natural sort order in parquet part paths#7249
Conversation
dask/dataframe/io/parquet/utils.py
Outdated
|
|
||
| # numeric rather than glob ordering | ||
| out_list = sorted(out_list, key=natural_sort_key) | ||
| file_list = sorted(file_list, key=natural_sort_key) |
There was a problem hiding this comment.
Shouldn't this go up before line 455 where the path_parts_list is created and then the out_list should not be sorted at all. The way this is I am a little worried that the file_list and the out_list will be ordered differently.
There was a problem hiding this comment.
good point. The changes to this file are all gone now though, based on @rjzamora's suggestion below. Files should be sorted by callers pre-_analyze_paths.
Maybe a second sort in _analyze_paths + warning-log makes sense, to check that callers aren't making this mistake?
| return ( | ||
| file_list, | ||
| "/".join(basepath), | ||
| out_list, | ||
| ) # use '/'.join() instead of _join_path to be consistent with split('/') |
There was a problem hiding this comment.
I realize that _analyze_paths is supposed to be "private", but there are other libraries (like nvtabular) that are actually using this utility (and expecting two elements). It is probably best for those libraries to move away from using any private APIs asap, but I'd like to avoid changing the API (at leasts for default usage) in the short term.
Perhaps you could either (1) Introduce an optional sort= kwarg that will result in the sorted paths being returned, or (2) Introduce a simple _sort_and_analyze_paths utility to wrap the existing _analyze_paths (sorting input paths before, and out_list after). Sorry - I know your current solution is a bit cleaner :/
I also agree with @jsignell that it may be best to sort the input paths at the begining of (or before) _analyze_paths.
There was a problem hiding this comment.
I simplified things by just sorting at every _analyze_paths call-site, prior to _analyze_paths.
Re-reading your comment now, I guess that wasn't actually one of your suggested fixes 😁
Making a _sort_and_analyze_paths and using it in this code-base (and encouraging any ecosystem users to switch to it) sounds good. I guess a sort=False param to _analyze_paths that changes the arity of the returned tuple works too.
Let me know what you prefer.
There was a problem hiding this comment.
alright I went ahead and moved to _sort_and_analyze_paths; I'll push it here once CI passes on the current commit, so we have two ✅s to choose from
|
Looks like there was a failure in |
rjzamora
left a comment
There was a problem hiding this comment.
I think I am good with the addition of _sort_and_analyze_paths here. I like that the solution is simple. With that said, it may still be useful to have a sort= argument to _analyze_paths instead, in case we decide later to allow path sorting to be dissabled. For example, I can imagine a use case where the disired order of partitions does not correspond to the sorted order (but to the explicit order of the input).
| paths = sorted(paths, key=natural_sort_key) # numeric rather than glob ordering | ||
| ap_kwargs = {"root": root_dir} if root_dir else {} | ||
| root_dir, fns = _analyze_paths(paths, fs, **ap_kwargs) | ||
| paths, root_dir, fns = _sort_and_analyze_paths(paths, fs, **ap_kwargs) |
There was a problem hiding this comment.
I take it we don't need the paths = sorted(paths, key=natural_sort_key) line here if the sorting will be done in _sort_and_analyze_paths anyway?
There was a problem hiding this comment.
yes, also an oversight, sorry about that, fixed
dask/dataframe/io/parquet/arrow.py
Outdated
| @@ -23,6 +23,7 @@ | |||
| _analyze_paths, | |||
There was a problem hiding this comment.
Do we still need this import? It seems that the remaining use of _analyze_paths could also use the new api.
There was a problem hiding this comment.
good pt, fixed
dask/dataframe/io/parquet/arrow.py
Outdated
| allpaths = sorted(allpaths, key=natural_sort_key) | ||
| _, base, fns = _sort_and_analyze_paths(allpaths, fs) |
There was a problem hiding this comment.
How is this different from allpaths, base, fns = _sort_and_analyze_paths(allpaths, fs) ?
There was a problem hiding this comment.
good catch, fixed
highlight situations where with >10 partitions, statistics are not gathered due to incorrect numeric sort
|
Sorry for those oversights @rjzamora, and thanks for the close review. I think I addressed everything. I stopped short of adding the I hesitated bc it's hard for me to imagine ever switching the default to The new Open to your thoughts though. |
rjzamora
left a comment
There was a problem hiding this comment.
Thanks for the work here @ryan-williams !
|
Yeah thanks for sticking with this @ryan-williams ! |
black dask/flake8 daskdivisions)