-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12644: [C++][Python][R][Dataset] URL-decode path segments in partitioning #10264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7652fb5 to
6962fe6
Compare
|
Ah, and so it turns out there is indeed a very good reason to url-escape names even on local file systems: Windows doesn't like characters like |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the R changes here, and I only skimmed through the dataset tests. Perhaps @bkietz can validate the latter.
cpp/src/arrow/dataset/partition.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: is there any reason we don't utf8-validate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because we didn't before, but perhaps we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would seem more consistent. I don't see any reason to make a difference between the two cases.
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid at some point, we'll have to deal with other encodings (e.g. hex, or some system-specific oddity). Perhaps make this an enum instead, so that it's open-ended?
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, I think KeyValuePartitioningOptions options = {} can be used as a shorter spelling.
cpp/src/arrow/dataset/partition.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you don't use the same spelling as below, i.e. segment.substr(name_end + 1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to avoid a copy, but this can be done as string_view(segment).substr(...) instead.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a boolean option, I'd rather make it keyword-only, i.e. def __init__(..., *, bint url_decode_segments=True).
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but we use Uri in most of the code base (including uri.h :-)).
|
I added UTF-8 validation and renamed URL to URI. |
|
It seems there are CI failures on Windows. |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for cleaning this up
Now by default, directory/hive partitioning will URL-decode potential partition values before trying to parse them, since systems like Spark apparently may URL-encode the values in some cases. Note for Hive partitioning, this applies only to the value, not to the key itself. This behavior can be toggled.