Skip to content

Ensure natural sort order in parquet part paths#7249

Merged
jsignell merged 5 commits intodask:mainfrom
celsiustx:parquet
Mar 25, 2021
Merged

Ensure natural sort order in parquet part paths#7249
jsignell merged 5 commits intodask:mainfrom
celsiustx:parquet

Conversation

@ryan-williams
Copy link
Contributor


  • First commit adds some tests that verify the existing behavior (incl. missing divisions)
  • Second commit fixes bug + updates tests

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this. Ping @rjzamora for visibility.


# numeric rather than glob ordering
out_list = sorted(out_list, key=natural_sort_key)
file_list = sorted(file_list, key=natural_sort_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go up before line 455 where the path_parts_list is created and then the out_list should not be sorted at all. The way this is I am a little worried that the file_list and the out_list will be ordered differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. The changes to this file are all gone now though, based on @rjzamora's suggestion below. Files should be sorted by callers pre-_analyze_paths.

Maybe a second sort in _analyze_paths + warning-log makes sense, to check that callers aren't making this mistake?

Comment on lines 484 to 487
return (
file_list,
"/".join(basepath),
out_list,
) # use '/'.join() instead of _join_path to be consistent with split('/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that _analyze_paths is supposed to be "private", but there are other libraries (like nvtabular) that are actually using this utility (and expecting two elements). It is probably best for those libraries to move away from using any private APIs asap, but I'd like to avoid changing the API (at leasts for default usage) in the short term.

Perhaps you could either (1) Introduce an optional sort= kwarg that will result in the sorted paths being returned, or (2) Introduce a simple _sort_and_analyze_paths utility to wrap the existing _analyze_paths (sorting input paths before, and out_list after). Sorry - I know your current solution is a bit cleaner :/

I also agree with @jsignell that it may be best to sort the input paths at the begining of (or before) _analyze_paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified things by just sorting at every _analyze_paths call-site, prior to _analyze_paths.

Re-reading your comment now, I guess that wasn't actually one of your suggested fixes 😁

Making a _sort_and_analyze_paths and using it in this code-base (and encouraging any ecosystem users to switch to it) sounds good. I guess a sort=False param to _analyze_paths that changes the arity of the returned tuple works too.

Let me know what you prefer.

Copy link
Contributor Author

@ryan-williams ryan-williams Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright I went ahead and moved to _sort_and_analyze_paths; I'll push it here once CI passes on the current commit, so we have two ✅s to choose from

@ryan-williams
Copy link
Contributor Author

Looks like there was a failure in dask/tests/test_threaded.py::test_interrupt on macos-latest, 3.9 only; I'm guessing it is flaky?

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am good with the addition of _sort_and_analyze_paths here. I like that the solution is simple. With that said, it may still be useful to have a sort= argument to _analyze_paths instead, in case we decide later to allow path sorting to be dissabled. For example, I can imagine a use case where the disired order of partitions does not correspond to the sorted order (but to the explicit order of the input).

Comment on lines 750 to +751
paths = sorted(paths, key=natural_sort_key) # numeric rather than glob ordering
ap_kwargs = {"root": root_dir} if root_dir else {}
root_dir, fns = _analyze_paths(paths, fs, **ap_kwargs)
paths, root_dir, fns = _sort_and_analyze_paths(paths, fs, **ap_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it we don't need the paths = sorted(paths, key=natural_sort_key) line here if the sorting will be done in _sort_and_analyze_paths anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, also an oversight, sorry about that, fixed

@@ -23,6 +23,7 @@
_analyze_paths,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this import? It seems that the remaining use of _analyze_paths could also use the new api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good pt, fixed

Comment on lines +1640 to +1641
allpaths = sorted(allpaths, key=natural_sort_key)
_, base, fns = _sort_and_analyze_paths(allpaths, fs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from allpaths, base, fns = _sort_and_analyze_paths(allpaths, fs) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed

@ryan-williams
Copy link
Contributor Author

Sorry for those oversights @rjzamora, and thanks for the close review. I think I addressed everything.

I stopped short of adding the sort=False param to _analyze_paths, but am happy to add that if you think I should.

I hesitated bc it's hard for me to imagine ever switching the default to sort=True (since that would change the arity of the returned tuple, and downstream code-bases already assume that returned tuple shape). A new flag that will always default to the existing (less desirable in most cases) non-sorting behavior doesn't seem useful.

The new _sort_and_analyze_paths function feels easier to migrate callers to, and the "sort disabled" case is still supported by the existing _analyze_paths.

Open to your thoughts though.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here @ryan-williams !

@jsignell
Copy link
Member

Yeah thanks for sticking with this @ryan-williams !

@jsignell jsignell merged commit fa2f425 into dask:main Mar 25, 2021
@ryan-williams ryan-williams deleted the parquet branch March 25, 2021 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet statistics dropped due to non-numeric filename sort in dirs w/ ≥10 partitions

3 participants