-
Notifications
You must be signed in to change notification settings - Fork 116
Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bakkot @Jarred-Sumner : comments and review invited. This will be part of a new major release. |
|
Excellent! A few minor things at a quick glance:
|
|
@bakkot Thank you for the comments. I will apply fixes soon. |
|
@bakkot Thanks. I have answered your concerns in a later commit. A BASE64_INPUT_REMAINDER is not considered an error in the sense that you can continue the processing. When an invalid character is encountered, the error is considered fatal (I made this clear now). This means that the 'state' is purposefully undocumented: you are not expected to rely on the state. In the example provide, it was indeed input bytes, but you are correct that in the general case, it is in terms of units (which can be 8-bit or 16-bit units). I have temporarily disabled the RVV tests since they fail to even start due to setup failures. |
| simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output) const noexcept { | ||
| return compress_decode_base64(output, input, length); | ||
| simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output, base64_options options) const noexcept { | ||
| return (options & base64_url) ? compress_decode_base64<true>(output, input, length, options) : compress_decode_base64<false>(output, input, length, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the conditional and directly pass it to the template argument of the function. Removal of a branch is a good :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is indeed a runtime branch here, and it is not free, but pushing it down might not make disappear. A different option would be to have distinct functions for base64url and regular base64, but I thought it was not very nice from an API point of view.
| size_t input_index = safe_input; | ||
| while(offset > 0 && input_index > 0) { | ||
| chartype c = input[--input_index]; | ||
| if(c == '=' || c == '\n' || c == '\r' || c == '\t' || c == ' ') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is common we can just create a table for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code could be faster and simpler, but I am adding a comment to explain why it appears unoptimized:
// offset is a value that is no larger than 3. We backtrack
// by up to offset characters + an undetermined number of
// white space characters. It is expected that the next loop
// runs at most 3 times + the number of white space characters
// in between them, so we are not worried about performance.
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>
WojciechMula
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Haven't spotted anything suspicious.
| // base64_options are used to specify the base64 encoding options. | ||
| using base64_options = uint64_t; | ||
| enum : base64_options { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum class maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I am open to using an enum class but I am a bit concerned but I really want this to be a bitset that we can extend later to support different options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I verified that enum classes do not work well as bitsets:
enum class Joe : uint64_t {
default_base64 = 0,
url_base64 = 1,
allow_spaces = 4,
allow_padding = 8,
};
int main() {
Joe t = Joe::default_base64 | Joe:: allow_padding; // Will not compile
}The idea here is to be able to extend the API without too much of a mess by allowing 'options' if people have a great need for them. Like we could disable white spaces in a later version. Enum classes makes this more difficult than need be.
Granted, they are safer but we could validate the values if we are concerned.
|
Thanks for the great reviews. Merging. |
References: