Skip to content

Conversation

@lemire
Copy link
Member

@lemire lemire commented Sep 3, 2024

This is an alternative PR to @ronag's PR (who should be fully credited for the work) at #554

It is basically @ronag's code with

  1. A proposed name change... the suffix _s is replaced by _safe.
  2. A simplification: we do not include the new function in the kernels, instead we use @ronag's algorithm on top of existing fast functions.
  3. It adds a sanity test.

ronag and others added 3 commits August 31, 2024 11:11
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
@lemire lemire requested a review from ronag September 3, 2024 17:30
@lemire
Copy link
Member Author

lemire commented Sep 3, 2024

As soon as the test go green, we will merge and release soon after.

* @param utf8_len the maximum output length
* @return the number of written char; 0 if conversion is not possible
*/
simdutf_warn_unused size_t convert_latin1_to_utf8_safe(const char * input, size_t length, char* utf8_output, size_t utf8_len) noexcept;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I suggest keeping the name but changing the signature to take a std::span as output? that means the type system conveys the meaning. std span is C++20 which may be a no go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std span is C++20 which may be a no go.

We support C++11. We could add std::span support, but it would have to be optional (so, on top of the existing API). Note that this applies to our whole API. I will add an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done : #557

@pauldreik
Copy link
Collaborator

I added a fuzzer for this, it found nothing. I will make a PR later. In the meanwhile, it can be reviewed here: pauldreik@7bf34b6

@lemire
Copy link
Member Author

lemire commented Sep 3, 2024

@pauldreik Fantastic!!!

@lemire
Copy link
Member Author

lemire commented Sep 3, 2024

I am merging this, but we will wait for the fuzzer before releasing.

@lemire lemire merged commit 474bbea into master Sep 3, 2024
@lemire
Copy link
Member Author

lemire commented Sep 3, 2024

@ronag Note that the PR is merged with your credit:

Screenshot 2024-09-03 at 4 03 12 PM

@pauldreik
Copy link
Collaborator

there seems to be potential for improvement or optimization based on the coverage from the fuzzer (or possibly, the fuzzer is not able to generate all possible interesting input).
this is from https://storage.googleapis.com/oss-fuzz-coverage/simdutf/reports/20240905/linux/src/simdutf/src/scalar/latin1_to_utf8/latin1_to_utf8.h.html

bild

ping @lemire

@ronag
Copy link
Collaborator

ronag commented Sep 6, 2024

Not sure what that means? Does it mean that it never gets there? It's a very edge casy condition that needs a very specific state to occur.

@pauldreik
Copy link
Collaborator

pauldreik commented Sep 6, 2024

Not sure what that means? Does it mean that it never gets there?

yes. the light blue (and red) numbers are the hit counts from the fuzz corpus.

It's a very edge casy condition that needs a very specific state to occur.

Can you provide me with an example of input data that gets there?

@lemire
Copy link
Member Author

lemire commented Sep 6, 2024

@pauldreik Hmmmm.... What aboout the single byte 0x80 with utf8_len equal to 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants