Convert latin1 to utf8 safe #556

lemire · 2024-09-03T17:30:21Z

This is an alternative PR to @ronag's PR (who should be fully credited for the work) at #554

It is basically @ronag's code with

A proposed name change... the suffix _s is replaced by _safe.
A simplification: we do not include the new function in the kernels, instead we use @ronag's algorithm on top of existing fast functions.
It adds a sanity test.

Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length. Refs: nodejs/node#54526

…imdutf into convert_latin1_to_utf8_s

lemire · 2024-09-03T17:45:44Z

As soon as the test go green, we will merge and release soon after.

README.md

pauldreik · 2024-09-03T17:51:58Z

README.md

+ * @param utf8_len  the maximum output length
+ * @return the number of written char; 0 if conversion is not possible
+ */
+simdutf_warn_unused size_t convert_latin1_to_utf8_safe(const char * input, size_t length, char* utf8_output, size_t utf8_len) noexcept;


may I suggest keeping the name but changing the signature to take a std::span as output? that means the type system conveys the meaning. std span is C++20 which may be a no go.

std span is C++20 which may be a no go.

We support C++11. We could add std::span support, but it would have to be optional (so, on top of the existing API). Note that this applies to our whole API. I will add an issue.

Done : #557

include/simdutf/implementation.h

pauldreik · 2024-09-03T19:04:55Z

I added a fuzzer for this, it found nothing. I will make a PR later. In the meanwhile, it can be reviewed here: pauldreik@7bf34b6

lemire · 2024-09-03T20:02:34Z

@pauldreik Fantastic!!!

lemire · 2024-09-03T20:03:02Z

I am merging this, but we will wait for the fuzzer before releasing.

lemire · 2024-09-03T20:03:40Z

@ronag Note that the PR is merged with your credit:

pauldreik · 2024-09-06T11:07:43Z

there seems to be potential for improvement or optimization based on the coverage from the fuzzer (or possibly, the fuzzer is not able to generate all possible interesting input).
this is from https://storage.googleapis.com/oss-fuzz-coverage/simdutf/reports/20240905/linux/src/simdutf/src/scalar/latin1_to_utf8/latin1_to_utf8.h.html

ping @lemire

ronag · 2024-09-06T11:15:52Z

Not sure what that means? Does it mean that it never gets there? It's a very edge casy condition that needs a very specific state to occur.

pauldreik · 2024-09-06T11:42:44Z

Not sure what that means? Does it mean that it never gets there?

yes. the light blue (and red) numbers are the hit counts from the fuzz corpus.

It's a very edge casy condition that needs a very specific state to occur.

Can you provide me with an example of input data that gets there?

lemire · 2024-09-06T11:53:43Z

@pauldreik Hmmmm.... What aboout the single byte 0x80 with utf8_len equal to 1?

ronag and others added 3 commits August 31, 2024 11:11

feat: convert_latin1_to_utf8_s

eff6b01

Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length. Refs: nodejs/node#54526

Merge branch 'convert_latin1_to_utf8_s' of https://github.com/ronag/s…

2ecf3c4

…imdutf into convert_latin1_to_utf8_s

various fixes

e763173

lemire requested a review from ronag September 3, 2024 17:30

lemire mentioned this pull request Sep 3, 2024

feat: convert_latin1_to_utf8_s #554

Closed

ronag approved these changes Sep 3, 2024

View reviewed changes

pauldreik reviewed Sep 3, 2024

View reviewed changes

[no-ci] just some code reformatting

6ede96b

lemire merged commit 474bbea into master Sep 3, 2024

Convert latin1 to utf8 safe #556

Convert latin1 to utf8 safe #556

Uh oh!

Conversation

lemire commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Sep 3, 2024

Uh oh!

Uh oh!

pauldreik Sep 3, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Sep 3, 2024

Choose a reason for hiding this comment

Uh oh!

lemire Sep 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pauldreik commented Sep 3, 2024

Uh oh!

lemire commented Sep 3, 2024

Uh oh!

lemire commented Sep 3, 2024

Uh oh!

lemire commented Sep 3, 2024

Uh oh!

pauldreik commented Sep 6, 2024

Uh oh!

ronag commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pauldreik commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Sep 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lemire commented Sep 3, 2024 •

edited

Loading

ronag commented Sep 6, 2024 •

edited

Loading

pauldreik commented Sep 6, 2024 •

edited

Loading