Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented May 7, 2021

Now by default, directory/hive partitioning will URL-decode potential partition values before trying to parse them, since systems like Spark apparently may URL-encode the values in some cases. Note for Hive partitioning, this applies only to the value, not to the key itself. This behavior can be toggled.

@github-actions
Copy link

github-actions bot commented May 7, 2021

@lidavidm
Copy link
Member Author

lidavidm commented May 7, 2021

Ah, and so it turns out there is indeed a very good reason to url-escape names even on local file systems: Windows doesn't like characters like :. Fun! These test cases need reworking, then…

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the R changes here, and I only skimmed through the dataset tests. Perhaps @bkietz can validate the latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: is there any reason we don't utf8-validate here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because we didn't before, but perhaps we should.

Copy link
Member

@pitrou pitrou Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem more consistent. I don't see any reason to make a difference between the two cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid at some point, we'll have to deal with other encodings (e.g. hex, or some system-specific oddity). Perhaps make this an enum instead, so that it's open-ended?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, I think KeyValuePartitioningOptions options = {} can be used as a shorter spelling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you don't use the same spelling as below, i.e. segment.substr(name_end + 1)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid a copy, but this can be done as string_view(segment).substr(...) instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a boolean option, I'd rather make it keyword-only, i.e. def __init__(..., *, bint url_decode_segments=True).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but we use Uri in most of the code base (including uri.h :-)).

@lidavidm
Copy link
Member Author

lidavidm commented Jun 2, 2021

I added UTF-8 validation and renamed URL to URI.

@pitrou
Copy link
Member

pitrou commented Jun 2, 2021

It seems there are CI failures on Windows.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for cleaning this up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants