Skip to content

Conversation

@lemire
Copy link
Member

@lemire lemire commented Mar 30, 2024

  • Adds base64url support in addition to regular base64 support, in both instances, we support arbitrary ASCII spaces within the base64 inputs, with full validation (as per the WHATWG forgiving-base64 standard)
  • Adds support for decoding base64 from char16_t inputs (native order), with both base64url and regular base64 support: this might be convenient when the input comes in UTF-16 order. Currently we only support native order, but we have fast (accelerated) functions to change the endianness if needed.
  • Adds a new base64_to_binary_safe which allow you to specify the size of the output buffer when decoding base64, this was an idea put forth by Wojciech
  • We also have a new encoding benchmark, designed around what is used in the Bun runtime, thus validating that we have competitive encoding speed.

References:

@lemire lemire changed the title Base64 part2 Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe Mar 30, 2024
@lemire lemire requested review from WojciechMula and anonrig March 30, 2024 04:01
@lemire
Copy link
Member Author

lemire commented Mar 30, 2024

@bakkot @Jarred-Sumner : comments and review invited. This will be part of a new major release.

@lemire lemire mentioned this pull request Mar 30, 2024
@bakkot
Copy link

bakkot commented Mar 30, 2024

Excellent! A few minor things at a quick glance:

  • The value of the count field in the OUTPUT_BUFFER_TOO_SMALL case is not documented. From the sample code it appears to be the number of characters of input actually read (or equivalently the index of the next character to read), which is what I'd expect, but it would be good to write it down. The sample code says "we decoded r.count base64 bytes", which is misleading in the utf16 case (or I have misunderstood it): hopefully this is the number of input characters, which might be bytes but also might be 16-bit code units.
  • It's not documented (that I can see) what the state of the buffer is in the error cases. In all cases I believe it's "as many bytes are written as possible prior to the chunk of 4 characters which caused the error", but it would be nice to write this down.
  • One place says that "When the error is BASE64_INPUT_REMAINDER, then r.count contains the number of bytes decoded." but another place says the return value holds "a result pair struct [...] with an error code and either position of the error (in the input in 16-bit units) if any [...]" These are contradictory; in the BASE64_INPUT_REMAINDER case the count holds not a position in the input but instead a count of bytes written to the output.
  • Typo: "a singler remainder character" -> "a single remainder character" (this was preexisting but is now in a few more places).

@lemire
Copy link
Member Author

lemire commented Mar 30, 2024

@bakkot Thank you for the comments. I will apply fixes soon.

@lemire
Copy link
Member Author

lemire commented Mar 30, 2024

@bakkot Thanks. I have answered your concerns in a later commit. A BASE64_INPUT_REMAINDER is not considered an error in the sense that you can continue the processing.

When an invalid character is encountered, the error is considered fatal (I made this clear now). This means that the 'state' is purposefully undocumented: you are not expected to rely on the state.

In the example provide, it was indeed input bytes, but you are correct that in the general case, it is in terms of units (which can be 8-bit or 16-bit units).

I have temporarily disabled the RVV tests since they fail to even start due to setup failures.

simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output) const noexcept {
return compress_decode_base64(output, input, length);
simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output, base64_options options) const noexcept {
return (options & base64_url) ? compress_decode_base64<true>(output, input, length, options) : compress_decode_base64<false>(output, input, length, options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the conditional and directly pass it to the template argument of the function. Removal of a branch is a good :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is indeed a runtime branch here, and it is not free, but pushing it down might not make disappear. A different option would be to have distinct functions for base64url and regular base64, but I thought it was not very nice from an API point of view.

size_t input_index = safe_input;
while(offset > 0 && input_index > 0) {
chartype c = input[--input_index];
if(c == '=' || c == '\n' || c == '\r' || c == '\t' || c == ' ') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is common we can just create a table for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code could be faster and simpler, but I am adding a comment to explain why it appears unoptimized:

  // offset is a value that is no larger than 3. We backtrack
  // by up to offset characters + an undetermined number of
  // white space characters. It is expected that the next loop
  // runs at most 3 times + the number of white space characters
  // in between them, so we are not worried about performance.

Copy link
Collaborator

@WojciechMula WojciechMula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Haven't spotted anything suspicious.

// base64_options are used to specify the base64 encoding options.
using base64_options = uint64_t;
enum : base64_options {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum class maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I am open to using an enum class but I am a bit concerned but I really want this to be a bitset that we can extend later to support different options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I verified that enum classes do not work well as bitsets:

enum class Joe : uint64_t {
    default_base64 = 0,
    url_base64 = 1,
    allow_spaces = 4,
    allow_padding = 8,
};
int main() {
    Joe t = Joe::default_base64 | Joe:: allow_padding; // Will not compile
}

The idea here is to be able to extend the API without too much of a mess by allowing 'options' if people have a great need for them. Like we could disable white spaces in a later version. Enum classes makes this more difficult than need be.

Granted, they are safer but we could validate the values if we are concerned.

@lemire
Copy link
Member Author

lemire commented Apr 1, 2024

Thanks for the great reviews. Merging.

@lemire lemire merged commit 420b161 into master Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants