Generalise Digest.{to,from}_hex to String.{to,from}_hex by dra27 · Pull Request #9446 · ocaml/ocaml

dra27 · 2020-04-14T16:48:22Z

There's a lovely specialised function in Digest for printing 16-byte strings as hexadecimal. This PR generalises that function in String (and updates Digest to use it).

I'm not sure what the policy (or if there is one) for whether equivalent functions should be added in Bytes.

gasche · 2020-04-14T17:04:23Z

Would it make sense to have it in Char instead?

val to_hex : t -> t * t
val from_hex : t * t -> t

dra27 · 2020-04-14T17:14:39Z

Instead, or do you mean as well?

gasche · 2020-04-14T17:26:15Z

I meant "instead": hopefully it is very easy to write the String versions from the Char versions, or the Bytes versions (or the Seq versions, etc.), and it solves the discussion of which one to include ("none of them" is now a reasonable option). But if you strongly believe that the convenience of having them in String (or in Bytes, etc.) is important, please feel free to keep proposing them.

dra27 · 2020-04-14T18:26:41Z

We don't have a version of map where the mapping function is char -> string so it's not convenient (and the implementation of that function is not trivial, which I think is why it's missing).

dra27 · 2020-04-15T09:33:22Z

I had a look through the PRs related to functions which are present in one module and not in the other (and, in the spirit of "fix something when you add something", I'll do a follow-up PR as in at least one PR the missing functions were mentioned as for a follow-up PR).

My conclusion was that functions should be in both for consistency, and that cross-domain functions (i.e. functions in Bytes taking string and functions in String taking bytes - cf. functions like Bytes.blit_string) should only be included if they would clearly be used a lot or are non-trivial to construct without copying (neither of which applies to the new functions in this PR)

bschommer · 2020-04-15T11:19:53Z

Another Function that might be worth copying to String is the check_suffix function of the Filename module. The function is actually only working with strings and I just recently found out that Filename.check_suffix can be used for this.

dbuenzli

Personally I think adding these functions should trigger a more forward looking design for String and Char. Namely I would prefer if these were at least put in to an Ascii submodule to indicate that they interpret bytes according to US-ASCII.

An example of such a design for String can be found here and for Char, here.

If people like that design I would be willing to take the time upstream it. Personally I still think US-ASCII is pervasives enough and takes you long way thanks to the UTF-8 compatibility property so that it warants the support provided by these modules in the stdlib.

Many of the functions provided in this Char.Ascii module I mention also map to C equivalents which are often find to be used by standards.

stdlib/bytesLabels.mli

dbuenzli · 2020-04-15T12:41:39Z

stdlib/bytesLabels.mli

+(** Convert a hexadecimal representation back into the corresponding byte
+    sequence.
+
+    @raise Invalid_argument if the argument is not an even number of hexadecimal


This is a parsing function. A parsing function is expected to fail, it should not raise Invalid_argument, at most Failure but I propose this signature instead, which is more useful.

I don't agree - parse_hex is a parsing function; from_hex is a conversion function. I agree that Failure is more appropriate than Invalid_argument, but I'm not convinced that it would be sensible to have Digest.from_hex raise Invalid_argument with these new ones raising Failure.

Quite happy to add parse_hex/from_hex_opt if wanted (my original purpose was this was to expose a function we already had)

I don't think the distinction you are trying to make is useful in practice. Basically if you read hex from a string to convert it into bytes you should expect it may fail.

Besides if new functions are added to the stdlib it would be good to at least make them useful. What you are proposing is terrible for error reporting and basically entails that if one wants to do good error reporting you will have to rewrite the functionality.

If I've just lexed a hexadecimal string, then there's no way that converting it is going to fail, and error handling is a bit-rotting waste of effort - the failure mode has already been dealt with by the lexer. If the only API is the option/result returning one, then there's at least a pointless match or call to Option.get, Result.get_ok with the same overall effect that if there's a bizarre programmer error (like I suddenly start parsing z as a valid hex character) then I still get Invalid_argument just from a different function. There's usually a case for both functions, but if you're only writing one then it should be the exception raising version.

If I've just lexed a hexadecimal string, then there's no way that converting it is going to fail, and error handling is a bit-rotting waste of effort

I'm not sure I understand you if you just parsed an hexadecimal representation to bytes (of_hex) then the conversion back to an hexadecimal representation (to_hex) cannot fail.

If you meant that once you converted to hexadecimal (to_hex) you want to immediately get back to bytes (of_hex) then I argue that's not what you are going to do in 99% of uses of the function.

For the remaining 1% composing with Result.get_ok is entirely sufficient and effortless. There's no need to clutter the API with multiple versions of the same function.

What @dra27 means is: if you use an ocamllex-generated lexer to input an even-length string of hexadecimal digits, then pass it to of_hex, you know for sure that it won't fail so the call to Result.get_ok is just noise.

alainfrisch · 2020-04-16T13:39:27Z

stdlib/bytes.ml

+  in
+  let byte i =
+    try digit (unsafe_get b i) lsl 4 + digit (unsafe_get b (i + 1))
+    with E -> raise (Invalid i)


You are returning the offset in the resulting string; isn't it more useful/natural to report the location error in the input? (That's how I read the docstring anyway.) That said, I'm not so sure that returning any location information is that useful in practice.

In terms of implementation, you could raise the external exception directly in the digit sub-function (passing it the offset).

Oops - the offset is wrong. However, I'd be happy to make it return 'a option too

alainfrisch · 2020-04-16T13:51:52Z

stdlib/bytes.mli

+    @since 4.11.0
+*)
+
+val from_hex : bytes -> bytes


Conversion functions are more often called of_XXX than from_XXX in the stdlib (I know, Digest.from_hex...).

That's a good point - I'll change that (it's OK that Digest.from_hex is older)

dra27 · 2020-04-16T15:58:25Z

Revised to use 'a option instead of ('a, int) result and so renamed the additional functions to _opt as in other parts of the stdlib. Assuming it still builds, this is probably me done, if anyone's willing to approve?

gasche · 2020-04-16T16:13:53Z

You didn't give a particular rationale for why we want these functions in (examples of situations that are not Digest and where it is important to have them, and where a specialized library isn't already doing the job fine).

Adding stdlib function has costs -- for example the cost of making think about whether we want to provide a compatibility wrapper for this function in our favorite compatibility project -- which should be offset by clear benefits.

(Thanks for adding tests! I will maliciously point out that the test is checking exactly the Char functions I had suggested)

After thinking more about the API, I think it is okay now. It's not perfect, but it fits the general style of the String/Bytes APIs (which are already not so nice). For example the Ascii submodule is a substantial improvement but it doesn't fit the current design. Regarding my Char suggestion, I think it's cute but I suspect that in practice you want to use that in systems-y settings where you either have a String or a Bytes available (and not a Char.t Seq.t, a rope, finger-tree or whatever).

dra27 · 2020-04-16T16:26:13Z

Ah, I didn't give that because I didn't particularly view this as adding a function - it was already there, partially exposed in another module! The rationale was printing binary strings - exactly what the original function is doing. Generalising this was considerably easier than adding a flag to Printf to support printing hex strings (if there's even a spare symbol available?).

I (badly) maintain a library which also tracks a part of the stdlib API, and even as one affected by these things, I don't think downstream stdlibs should impede change here.

gasche · 2020-04-16T16:27:57Z

(Personally I find slightly embarrassing, in retrospect, that Digest is in the standard library in the first place (in a way that specifies MD5), so I didn't consider it a rationale.)

dra27 · 2020-04-16T16:28:57Z

I don't disagree - but given that it's there, I find it similarly "embarrassing" to be able to say that we can print 16 byte hex-strings with the stdlib and nothing else!

stdlib/bytesLabels.mli

dra27 · 2020-04-20T15:18:31Z

Ping for whether there's approval for this one?

dbuenzli · 2020-04-20T16:10:04Z

I'm repeating myself but I strongly suggest not to return an option but a result value with the offending index in case of error (see here) and since @alainfrisch is fine with it, I would also like to suggest we drop the raising one.

@gasche mentions adding functions to the stdlib has a cost. Given that cost I think it is thus quite important to make sure what is being added is actually useful. As a stdlib user I won't be able to use the functions of_hex proposed in this PR because they prevent me from building user friendly error messages.

I'm also a bit surprised that design discussions are swept under the carpet by @gasche using this as an argument:

After thinking more about the API, I think it is okay now. It's not perfect, but it fits the general style of the String/Bytes APIs (which are already not so nice).

Let's continue to add things that are not so nice ? Let's not try to think about what would be a nice and useful stdlib ?

damiendoligez · 2020-09-28T09:20:48Z

stdlib/bytes.mli

+(** Return the given byte sequence as a byte sequence of US-ASCII lowercase
+    hexadecimal digits.


I find it strange to mention US-ASCII here. You could equally well write sequence of UTF-8 lowercase hexadecimal digits. You should mention both or maybe neither, since these two are all the reasonable encodings that exist.

You could equally well write sequence of UTF-8 lowercase hexadecimal digits

This would be confusing.

First it's always better not to confuse code points and encodings, so something like a sequence of UTF-8 encoded Unicode lowercase hexadecimal digits should rather be written. But that would be, depending on your perspective, ambiguous or wrong.

Unicode has two properties to distinguish hexdecimal digits: Hex_Digit and ASCII_Hex_Digit. They do not coincide and for example U+FF45 has both the Lowercase property and Hex_Digit property. This function however does not handle this character.

I think the current wording is a precise and accurate description of the function and is consistent with other doc strings that work on US-ASCII (e.g. ascii_lowercase). The docs should rather educate people on the fact that US-ASCII code points are compatible with Unicode code points and that the US-ASCII encoding and UTF-8 encoding of these code points are equal.

damiendoligez · 2020-09-28T09:29:20Z

stdlib/bytesLabels.mli

+(** Convert a hexadecimal representation back into the corresponding byte
+    sequence.
+
+    @raise Invalid_argument if the argument is not an even number of hexadecimal


What @dra27 means is: if you use an ocamllex-generated lexer to input an even-length string of hexadecimal digits, then pass it to of_hex, you know for sure that it won't fail so the call to Result.get_ok is just noise.

dra27 · 2020-11-25T10:11:50Z

Doesn't look like this is ever to get over the line, so closing!

dra27 added the stdlib label Apr 14, 2020

dra27 force-pushed the to_hex branch from fd00d8e to cad450c Compare April 15, 2020 09:22

dra27 mentioned this pull request Apr 15, 2020

Synchronise functions in Bytes and String #9448

Merged

dbuenzli reviewed Apr 15, 2020

View reviewed changes

alainfrisch reviewed Apr 16, 2020

View reviewed changes

Generalise Digest.{to,from}_hex to Bytes/String

588107f

dra27 force-pushed the to_hex branch 2 times, most recently from 91aa33e to 38926cd Compare April 16, 2020 15:56

gasche mentioned this pull request Apr 16, 2020

Meta-issue on the semantics of the Bytes/String divide (representation, onwership-safe functions, optimizers and all that) #9451

Closed

alainfrisch reviewed Apr 16, 2020

View reviewed changes

stdlib/bytesLabels.mli Outdated Show resolved Hide resolved

dra27 force-pushed the to_hex branch from 38926cd to f2dd44d Compare April 16, 2020 19:31

dra27 force-pushed the to_hex branch from f2dd44d to 588107f Compare May 1, 2020 10:09

bschommer mentioned this pull request May 4, 2020

String suffix and prefix functions. #9533

Merged

gasche mentioned this pull request May 13, 2020

A format specifier for bytes #6429

Open

damiendoligez reviewed Sep 28, 2020

View reviewed changes

dra27 closed this Nov 25, 2020

dra27 deleted the to_hex branch July 6, 2021 15:09

dra27 mentioned this pull request Jun 14, 2022

add hexadecimal conversions for strings #11320

Closed

		(** Return the given byte sequence as a byte sequence of US-ASCII lowercase
		hexadecimal digits.

Conversation

dra27 commented Apr 14, 2020

Uh oh!

gasche commented Apr 14, 2020

Uh oh!

dra27 commented Apr 14, 2020

Uh oh!

gasche commented Apr 14, 2020

Uh oh!

dra27 commented Apr 14, 2020

Uh oh!

dra27 commented Apr 15, 2020

Uh oh!

bschommer commented Apr 15, 2020

Uh oh!

dbuenzli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dra27 commented Apr 16, 2020

Uh oh!

gasche commented Apr 16, 2020

Uh oh!

dra27 commented Apr 16, 2020

Uh oh!

gasche commented Apr 16, 2020

Uh oh!

dra27 commented Apr 16, 2020

Uh oh!

Uh oh!

dra27 commented Apr 20, 2020

Uh oh!

dbuenzli commented Apr 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dra27 commented Nov 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants