Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe #382

lemire · 2024-03-30T03:39:45Z

Adds base64url support in addition to regular base64 support, in both instances, we support arbitrary ASCII spaces within the base64 inputs, with full validation (as per the WHATWG forgiving-base64 standard)
Adds support for decoding base64 from char16_t inputs (native order), with both base64url and regular base64 support: this might be convenient when the input comes in UTF-16 order. Currently we only support native order, but we have fast (accelerated) functions to change the endianness if needed.
Adds a new base64_to_binary_safe which allow you to specify the size of the output buffer when decoding base64, this was an idea put forth by Wojciech
We also have a new encoding benchmark, designed around what is used in the Bun runtime, thus validating that we have competitive encoding speed.

References:

…_part2

lemire · 2024-03-30T04:04:55Z

@bakkot @Jarred-Sumner : comments and review invited. This will be part of a new major release.

bakkot · 2024-03-30T06:05:02Z

Excellent! A few minor things at a quick glance:

The value of the count field in the OUTPUT_BUFFER_TOO_SMALL case is not documented. From the sample code it appears to be the number of characters of input actually read (or equivalently the index of the next character to read), which is what I'd expect, but it would be good to write it down. The sample code says "we decoded r.count base64 bytes", which is misleading in the utf16 case (or I have misunderstood it): hopefully this is the number of input characters, which might be bytes but also might be 16-bit code units.
It's not documented (that I can see) what the state of the buffer is in the error cases. In all cases I believe it's "as many bytes are written as possible prior to the chunk of 4 characters which caused the error", but it would be nice to write this down.
One place says that "When the error is BASE64_INPUT_REMAINDER, then r.count contains the number of bytes decoded." but another place says the return value holds "a result pair struct [...] with an error code and either position of the error (in the input in 16-bit units) if any [...]" These are contradictory; in the BASE64_INPUT_REMAINDER case the count holds not a position in the input but instead a count of bytes written to the output.
Typo: "a singler remainder character" -> "a single remainder character" (this was preexisting but is now in a few more places).

lemire · 2024-03-30T15:19:04Z

@bakkot Thank you for the comments. I will apply fixes soon.

lemire · 2024-03-30T15:51:13Z

@bakkot Thanks. I have answered your concerns in a later commit. A BASE64_INPUT_REMAINDER is not considered an error in the sense that you can continue the processing.

When an invalid character is encountered, the error is considered fatal (I made this clear now). This means that the 'state' is purposefully undocumented: you are not expected to rely on the state.

In the example provide, it was indeed input bytes, but you are correct that in the general case, it is in terms of units (which can be 8-bit or 16-bit units).

I have temporarily disabled the RVV tests since they fail to even start due to setup failures.

benchmarks/base64/CMakeLists.txt

benchmarks/base64/benchmark_base64.cpp

benchmarks/base64/libbase64_spaces.h

include/simdutf/implementation.h

src/arm64/arm_base64.cpp

anonrig · 2024-03-30T18:36:00Z

src/arm64/implementation.cpp

-simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output) const noexcept {
-  return compress_decode_base64(output, input, length);
+simdutf_warn_unused result implementation::base64_to_binary(const char * input, size_t length, char* output, base64_options options) const noexcept {
+  return (options & base64_url) ? compress_decode_base64<true>(output, input, length, options) : compress_decode_base64<false>(output, input, length, options);


We can remove the conditional and directly pass it to the template argument of the function. Removal of a branch is a good :-)

There is indeed a runtime branch here, and it is not free, but pushing it down might not make disappear. A different option would be to have distinct functions for base64url and regular base64, but I thought it was not very nice from an API point of view.

src/haswell/avx2_base64.cpp

anonrig · 2024-03-30T18:39:15Z

src/implementation.cpp

+  size_t input_index = safe_input;
+  while(offset > 0 && input_index > 0) {
+    chartype c = input[--input_index];
+    if(c == '=' || c == '\n' || c == '\r' || c == '\t' || c == ' ') {


If this is common we can just create a table for it

This code could be faster and simpler, but I am adding a comment to explain why it appears unoptimized:

// offset is a value that is no larger than 3. We backtrack // by up to offset characters + an undetermined number of // white space characters. It is expected that the next loop // runs at most 3 times + the number of white space characters // in between them, so we are not worried about performance.

src/scalar/base64.h

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

WojciechMula

Looks good to me. Haven't spotted anything suspicious.

WojciechMula · 2024-03-31T15:55:23Z

README.md

+
+// base64_options are used to specify the base64 encoding options.
+using base64_options = uint64_t;
+enum : base64_options {


enum class maybe?

Thanks. I am open to using an enum class but I am a bit concerned but I really want this to be a bitset that we can extend later to support different options.

Yeah, I verified that enum classes do not work well as bitsets:

enum class Joe : uint64_t { default_base64 = 0, url_base64 = 1, allow_spaces = 4, allow_padding = 8, }; int main() { Joe t = Joe::default_base64 | Joe:: allow_padding; // Will not compile }

The idea here is to be able to extend the API without too much of a mess by allowing 'options' if people have a great need for them. Like we could disable white spaces in a later version. Enum classes makes this more difficult than need be.

Granted, they are safer but we could validate the values if we are concerned.

lemire · 2024-04-01T15:16:37Z

Thanks for the great reviews. Merging.

lemire and others added 30 commits March 19, 2024 20:06

trimming some unnecessary code

3a64443

fixing missing rvv implementation

3f9cb0f

completing the base64 implementation.

a9ea1c6

adding ppc64

0f49240

saving

5daa520

saturated.

151aa09

finishing...

b917aa8

various fixes

ca17560

Implemented bun benchmark

94b7dac

Obvious fix.

c35d8df

documentation

1a90f2a

adding libbase64 competitor

bd454ea

more documentation.

bdab72f

base64url (first steps)

65f933b

working through

4aa837d

implemented base64url for ARM.

8dc79aa

documentation.

fe1138f

prototype base64url

5d1d0d5

solved based64url

21717c4

completing the base64 implementation.

c96ac90

adding ppc64

106e18c

saving

d1c9cbc

saturated.

8606798

finishing...

e7eae70

various fixes

9262b4b

Implemented bun benchmark

3444f4e

Obvious fix.

6949b2c

documentation

381945b

adding libbase64 competitor

7b304d3

more documentation.

f51ffdf

lemire and others added 5 commits March 29, 2024 23:38

documentation.

4971bc2

prototype base64url

c729247

solved based64url

e32acc9

Merge branch 'base64_part2' of github.com:simdutf/simdutf into base64…

038ce51

…_part2

fixing a missing func definition (bad signature)

9154818

lemire changed the title ~~Base64 part2~~ Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe Mar 30, 2024

lemire requested review from WojciechMula and anonrig March 30, 2024 04:01

lemire mentioned this pull request Mar 30, 2024

create higher level base64 functions #377

Open

no such thing as version 4 of uraimo/run-on-arch-action

fd037f5

lemire mentioned this pull request Mar 30, 2024

Base64 #375

Merged

fixes

0de753a

anonrig reviewed Mar 30, 2024

View reviewed changes

lemire and others added 9 commits March 30, 2024 15:15

Update benchmarks/base64/benchmark_base64.cpp

ccdf51d

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

Update benchmarks/base64/benchmark_base64.cpp

7ec70f2

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

Update benchmarks/base64/libbase64_spaces.h

18dc616

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

Update include/simdutf/implementation.h

aeb2f5f

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

Update src/haswell/avx2_base64.cpp

e0ce663

Co-authored-by: Yagiz Nizipli <yagiz@nizipli.com>

various minor fixes (linting + comments)

bb9d1fc

adding another comment.

f511d9a

cleaning up the base64 benchmark flags

e2a224f

disabling Ubuntu rvv VLEN=1024 (clang 17) CI due to system failures

5e6a366

WojciechMula approved these changes Mar 31, 2024

View reviewed changes

adding the option

9a92c54

lemire merged commit 420b161 into master Apr 1, 2024

Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe #382

Base64 part2 : base64url, UTF-8 inputs and base64_to_binary_safe #382

Uh oh!

Conversation

lemire commented Mar 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Mar 30, 2024

Uh oh!

bakkot commented Mar 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Mar 30, 2024

Uh oh!

lemire commented Mar 30, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anonrig Mar 30, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Mar 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anonrig Mar 30, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Mar 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WojciechMula left a comment

Choose a reason for hiding this comment

Uh oh!

WojciechMula Mar 31, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Apr 1, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Apr 1, 2024

Choose a reason for hiding this comment

Uh oh!

lemire commented Apr 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lemire commented Mar 30, 2024 •

edited

Loading

bakkot commented Mar 30, 2024 •

edited

Loading