Skip to content

Generalise Digest.{to,from}_hex to String.{to,from}_hex#9446

Closed
dra27 wants to merge 1 commit intoocaml:trunkfrom
dra27:to_hex
Closed

Generalise Digest.{to,from}_hex to String.{to,from}_hex#9446
dra27 wants to merge 1 commit intoocaml:trunkfrom
dra27:to_hex

Conversation

@dra27
Copy link
Copy Markdown
Member

@dra27 dra27 commented Apr 14, 2020

There's a lovely specialised function in Digest for printing 16-byte strings as hexadecimal. This PR generalises that function in String (and updates Digest to use it).

I'm not sure what the policy (or if there is one) for whether equivalent functions should be added in Bytes.

@dra27 dra27 added the stdlib label Apr 14, 2020
@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 14, 2020

Would it make sense to have it in Char instead?

val to_hex : t -> t * t
val from_hex : t * t -> t

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Apr 14, 2020

Instead, or do you mean as well?

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 14, 2020

I meant "instead": hopefully it is very easy to write the String versions from the Char versions, or the Bytes versions (or the Seq versions, etc.), and it solves the discussion of which one to include ("none of them" is now a reasonable option). But if you strongly believe that the convenience of having them in String (or in Bytes, etc.) is important, please feel free to keep proposing them.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Apr 14, 2020

We don't have a version of map where the mapping function is char -> string so it's not convenient (and the implementation of that function is not trivial, which I think is why it's missing).

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Apr 15, 2020

I had a look through the PRs related to functions which are present in one module and not in the other (and, in the spirit of "fix something when you add something", I'll do a follow-up PR as in at least one PR the missing functions were mentioned as for a follow-up PR).

My conclusion was that functions should be in both for consistency, and that cross-domain functions (i.e. functions in Bytes taking string and functions in String taking bytes - cf. functions like Bytes.blit_string) should only be included if they would clearly be used a lot or are non-trivial to construct without copying (neither of which applies to the new functions in this PR)

@bschommer
Copy link
Copy Markdown
Contributor

Another Function that might be worth copying to String is the check_suffix function of the Filename module. The function is actually only working with strings and I just recently found out that Filename.check_suffix can be used for this.

Copy link
Copy Markdown
Contributor

@dbuenzli dbuenzli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think adding these functions should trigger a more forward looking design for String and Char. Namely I would prefer if these were at least put in to an Ascii submodule to indicate that they interpret bytes according to US-ASCII.

An example of such a design for String can be found here and for Char, here.

If people like that design I would be willing to take the time upstream it. Personally I still think US-ASCII is pervasives enough and takes you long way thanks to the UTF-8 compatibility property so that it warants the support provided by these modules in the stdlib.

Many of the functions provided in this Char.Ascii module I mention also map to C equivalents which are often find to be used by standards.

(** Convert a hexadecimal representation back into the corresponding byte
sequence.

@raise Invalid_argument if the argument is not an even number of hexadecimal
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a parsing function. A parsing function is expected to fail, it should not raise Invalid_argument, at most Failure but I propose this signature instead, which is more useful.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree - parse_hex is a parsing function; from_hex is a conversion function. I agree that Failure is more appropriate than Invalid_argument, but I'm not convinced that it would be sensible to have Digest.from_hex raise Invalid_argument with these new ones raising Failure.

Quite happy to add parse_hex/from_hex_opt if wanted (my original purpose was this was to expose a function we already had)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the distinction you are trying to make is useful in practice. Basically if you read hex from a string to convert it into bytes you should expect it may fail.

Besides if new functions are added to the stdlib it would be good to at least make them useful. What you are proposing is terrible for error reporting and basically entails that if one wants to do good error reporting you will have to rewrite the functionality.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've just lexed a hexadecimal string, then there's no way that converting it is going to fail, and error handling is a bit-rotting waste of effort - the failure mode has already been dealt with by the lexer. If the only API is the option/result returning one, then there's at least a pointless match or call to Option.get, Result.get_ok with the same overall effect that if there's a bizarre programmer error (like I suddenly start parsing z as a valid hex character) then I still get Invalid_argument just from a different function. There's usually a case for both functions, but if you're only writing one then it should be the exception raising version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've just lexed a hexadecimal string, then there's no way that converting it is going to fail, and error handling is a bit-rotting waste of effort

I'm not sure I understand you if you just parsed an hexadecimal representation to bytes (of_hex) then the conversion back to an hexadecimal representation (to_hex) cannot fail.

If you meant that once you converted to hexadecimal (to_hex) you want to immediately get back to bytes (of_hex) then I argue that's not what you are going to do in 99% of uses of the function.

For the remaining 1% composing with Result.get_ok is entirely sufficient and effortless. There's no need to clutter the API with multiple versions of the same function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @dra27 means is: if you use an ocamllex-generated lexer to input an even-length string of hexadecimal digits, then pass it to of_hex, you know for sure that it won't fail so the call to Result.get_ok is just noise.

stdlib/bytes.ml Outdated
in
let byte i =
try digit (unsafe_get b i) lsl 4 + digit (unsafe_get b (i + 1))
with E -> raise (Invalid i)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • You are returning the offset in the resulting string; isn't it more useful/natural to report the location error in the input? (That's how I read the docstring anyway.) That said, I'm not so sure that returning any location information is that useful in practice.

  • In terms of implementation, you could raise the external exception directly in the digit sub-function (passing it the offset).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - the offset is wrong. However, I'd be happy to make it return 'a option too

stdlib/bytes.mli Outdated
@since 4.11.0
*)

val from_hex : bytes -> bytes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversion functions are more often called of_XXX than from_XXX in the stdlib (I know, Digest.from_hex...).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - I'll change that (it's OK that Digest.from_hex is older)

@dra27 dra27 force-pushed the to_hex branch 2 times, most recently from 91aa33e to 38926cd Compare April 16, 2020 15:56
@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Apr 16, 2020

Revised to use 'a option instead of ('a, int) result and so renamed the additional functions to _opt as in other parts of the stdlib. Assuming it still builds, this is probably me done, if anyone's willing to approve?

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 16, 2020

You didn't give a particular rationale for why we want these functions in (examples of situations that are not Digest and where it is important to have them, and where a specialized library isn't already doing the job fine).

Adding stdlib function has costs -- for example the cost of making think about whether we want to provide a compatibility wrapper for this function in our favorite compatibility project -- which should be offset by clear benefits.

(Thanks for adding tests! I will maliciously point out that the test is checking exactly the Char functions I had suggested)

After thinking more about the API, I think it is okay now. It's not perfect, but it fits the general style of the String/Bytes APIs (which are already not so nice). For example the Ascii submodule is a substantial improvement but it doesn't fit the current design. Regarding my Char suggestion, I think it's cute but I suspect that in practice you want to use that in systems-y settings where you either have a String or a Bytes available (and not a Char.t Seq.t, a rope, finger-tree or whatever).

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Apr 16, 2020

Ah, I didn't give that because I didn't particularly view this as adding a function - it was already there, partially exposed in another module! The rationale was printing binary strings - exactly what the original function is doing. Generalising this was considerably easier than adding a flag to Printf to support printing hex strings (if there's even a spare symbol available?).

I (badly) maintain a library which also tracks a part of the stdlib API, and even as one affected by these things, I don't think downstream stdlibs should impede change here.

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 16, 2020

(Personally I find slightly embarrassing, in retrospect, that Digest is in the standard library in the first place (in a way that specifies MD5), so I didn't consider it a rationale.)

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Apr 16, 2020

I don't disagree - but given that it's there, I find it similarly "embarrassing" to be able to say that we can print 16 byte hex-strings with the stdlib and nothing else!

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Apr 20, 2020

Ping for whether there's approval for this one?

@dbuenzli
Copy link
Copy Markdown
Contributor

I'm repeating myself but I strongly suggest not to return an option but a result value with the offending index in case of error (see here) and since @alainfrisch is fine with it, I would also like to suggest we drop the raising one.

@gasche mentions adding functions to the stdlib has a cost. Given that cost I think it is thus quite important to make sure what is being added is actually useful. As a stdlib user I won't be able to use the functions of_hex proposed in this PR because they prevent me from building user friendly error messages.

I'm also a bit surprised that design discussions are swept under the carpet by @gasche using this as an argument:

After thinking more about the API, I think it is okay now. It's not perfect, but it fits the general style of the String/Bytes APIs (which are already not so nice).

Let's continue to add things that are not so nice ? Let's not try to think about what would be a nice and useful stdlib ?

Comment on lines +333 to +334
(** Return the given byte sequence as a byte sequence of US-ASCII lowercase
hexadecimal digits.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange to mention US-ASCII here. You could equally well write sequence of UTF-8 lowercase hexadecimal digits. You should mention both or maybe neither, since these two are all the reasonable encodings that exist.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could equally well write sequence of UTF-8 lowercase hexadecimal digits

This would be confusing.

First it's always better not to confuse code points and encodings, so something like a sequence of UTF-8 encoded Unicode lowercase hexadecimal digits should rather be written. But that would be, depending on your perspective, ambiguous or wrong.

Unicode has two properties to distinguish hexdecimal digits: Hex_Digit and ASCII_Hex_Digit. They do not coincide and for example U+FF45 has both the Lowercase property and Hex_Digit property. This function however does not handle this character.

I think the current wording is a precise and accurate description of the function and is consistent with other doc strings that work on US-ASCII (e.g. ascii_lowercase). The docs should rather educate people on the fact that US-ASCII code points are compatible with Unicode code points and that the US-ASCII encoding and UTF-8 encoding of these code points are equal.

(** Convert a hexadecimal representation back into the corresponding byte
sequence.

@raise Invalid_argument if the argument is not an even number of hexadecimal
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @dra27 means is: if you use an ocamllex-generated lexer to input an even-length string of hexadecimal digits, then pass it to of_hex, you know for sure that it won't fail so the call to Result.get_ok is just noise.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Nov 25, 2020

Doesn't look like this is ever to get over the line, so closing!

@dra27 dra27 closed this Nov 25, 2020
@dra27 dra27 deleted the to_hex branch July 6, 2021 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants