Generalise Digest.{to,from}_hex to String.{to,from}_hex#9446
Generalise Digest.{to,from}_hex to String.{to,from}_hex#9446dra27 wants to merge 1 commit intoocaml:trunkfrom
Conversation
|
Would it make sense to have it in |
|
Instead, or do you mean as well? |
|
I meant "instead": hopefully it is very easy to write the String versions from the Char versions, or the Bytes versions (or the Seq versions, etc.), and it solves the discussion of which one to include ("none of them" is now a reasonable option). But if you strongly believe that the convenience of having them in String (or in Bytes, etc.) is important, please feel free to keep proposing them. |
|
We don't have a version of |
|
I had a look through the PRs related to functions which are present in one module and not in the other (and, in the spirit of "fix something when you add something", I'll do a follow-up PR as in at least one PR the missing functions were mentioned as for a follow-up PR). My conclusion was that functions should be in both for consistency, and that cross-domain functions (i.e. functions in |
|
Another Function that might be worth copying to |
dbuenzli
left a comment
There was a problem hiding this comment.
Personally I think adding these functions should trigger a more forward looking design for String and Char. Namely I would prefer if these were at least put in to an Ascii submodule to indicate that they interpret bytes according to US-ASCII.
An example of such a design for String can be found here and for Char, here.
If people like that design I would be willing to take the time upstream it. Personally I still think US-ASCII is pervasives enough and takes you long way thanks to the UTF-8 compatibility property so that it warants the support provided by these modules in the stdlib.
Many of the functions provided in this Char.Ascii module I mention also map to C equivalents which are often find to be used by standards.
stdlib/bytesLabels.mli
Outdated
| (** Convert a hexadecimal representation back into the corresponding byte | ||
| sequence. | ||
|
|
||
| @raise Invalid_argument if the argument is not an even number of hexadecimal |
There was a problem hiding this comment.
This is a parsing function. A parsing function is expected to fail, it should not raise Invalid_argument, at most Failure but I propose this signature instead, which is more useful.
There was a problem hiding this comment.
I don't agree - parse_hex is a parsing function; from_hex is a conversion function. I agree that Failure is more appropriate than Invalid_argument, but I'm not convinced that it would be sensible to have Digest.from_hex raise Invalid_argument with these new ones raising Failure.
Quite happy to add parse_hex/from_hex_opt if wanted (my original purpose was this was to expose a function we already had)
There was a problem hiding this comment.
I don't think the distinction you are trying to make is useful in practice. Basically if you read hex from a string to convert it into bytes you should expect it may fail.
Besides if new functions are added to the stdlib it would be good to at least make them useful. What you are proposing is terrible for error reporting and basically entails that if one wants to do good error reporting you will have to rewrite the functionality.
There was a problem hiding this comment.
If I've just lexed a hexadecimal string, then there's no way that converting it is going to fail, and error handling is a bit-rotting waste of effort - the failure mode has already been dealt with by the lexer. If the only API is the option/result returning one, then there's at least a pointless match or call to Option.get, Result.get_ok with the same overall effect that if there's a bizarre programmer error (like I suddenly start parsing z as a valid hex character) then I still get Invalid_argument just from a different function. There's usually a case for both functions, but if you're only writing one then it should be the exception raising version.
There was a problem hiding this comment.
If I've just lexed a hexadecimal string, then there's no way that converting it is going to fail, and error handling is a bit-rotting waste of effort
I'm not sure I understand you if you just parsed an hexadecimal representation to bytes (of_hex) then the conversion back to an hexadecimal representation (to_hex) cannot fail.
If you meant that once you converted to hexadecimal (to_hex) you want to immediately get back to bytes (of_hex) then I argue that's not what you are going to do in 99% of uses of the function.
For the remaining 1% composing with Result.get_ok is entirely sufficient and effortless. There's no need to clutter the API with multiple versions of the same function.
There was a problem hiding this comment.
What @dra27 means is: if you use an ocamllex-generated lexer to input an even-length string of hexadecimal digits, then pass it to of_hex, you know for sure that it won't fail so the call to Result.get_ok is just noise.
stdlib/bytes.ml
Outdated
| in | ||
| let byte i = | ||
| try digit (unsafe_get b i) lsl 4 + digit (unsafe_get b (i + 1)) | ||
| with E -> raise (Invalid i) |
There was a problem hiding this comment.
-
You are returning the offset in the resulting string; isn't it more useful/natural to report the location error in the input? (That's how I read the docstring anyway.) That said, I'm not so sure that returning any location information is that useful in practice.
-
In terms of implementation, you could raise the external exception directly in the digit sub-function (passing it the offset).
There was a problem hiding this comment.
Oops - the offset is wrong. However, I'd be happy to make it return 'a option too
stdlib/bytes.mli
Outdated
| @since 4.11.0 | ||
| *) | ||
|
|
||
| val from_hex : bytes -> bytes |
There was a problem hiding this comment.
Conversion functions are more often called of_XXX than from_XXX in the stdlib (I know, Digest.from_hex...).
There was a problem hiding this comment.
That's a good point - I'll change that (it's OK that Digest.from_hex is older)
91aa33e to
38926cd
Compare
|
Revised to use |
|
You didn't give a particular rationale for why we want these functions in (examples of situations that are not Digest and where it is important to have them, and where a specialized library isn't already doing the job fine). Adding stdlib function has costs -- for example the cost of making think about whether we want to provide a compatibility wrapper for this function in our favorite compatibility project -- which should be offset by clear benefits. (Thanks for adding tests! I will maliciously point out that the test is checking exactly the Char functions I had suggested) After thinking more about the API, I think it is okay now. It's not perfect, but it fits the general style of the String/Bytes APIs (which are already not so nice). For example the Ascii submodule is a substantial improvement but it doesn't fit the current design. Regarding my Char suggestion, I think it's cute but I suspect that in practice you want to use that in systems-y settings where you either have a String or a Bytes available (and not a Char.t Seq.t, a rope, finger-tree or whatever). |
|
Ah, I didn't give that because I didn't particularly view this as adding a function - it was already there, partially exposed in another module! The rationale was printing binary strings - exactly what the original function is doing. Generalising this was considerably easier than adding a flag to Printf to support printing hex strings (if there's even a spare symbol available?). I (badly) maintain a library which also tracks a part of the stdlib API, and even as one affected by these things, I don't think downstream stdlibs should impede change here. |
|
(Personally I find slightly embarrassing, in retrospect, that Digest is in the standard library in the first place (in a way that specifies MD5), so I didn't consider it a rationale.) |
|
I don't disagree - but given that it's there, I find it similarly "embarrassing" to be able to say that we can print 16 byte hex-strings with the stdlib and nothing else! |
|
Ping for whether there's approval for this one? |
|
I'm repeating myself but I strongly suggest not to return an option but a result value with the offending index in case of error (see here) and since @alainfrisch is fine with it, I would also like to suggest we drop the raising one. @gasche mentions adding functions to the stdlib has a cost. Given that cost I think it is thus quite important to make sure what is being added is actually useful. As a stdlib user I won't be able to use the functions I'm also a bit surprised that design discussions are swept under the carpet by @gasche using this as an argument:
Let's continue to add things that are not so nice ? Let's not try to think about what would be a nice and useful stdlib ? |
| (** Return the given byte sequence as a byte sequence of US-ASCII lowercase | ||
| hexadecimal digits. |
There was a problem hiding this comment.
I find it strange to mention US-ASCII here. You could equally well write sequence of UTF-8 lowercase hexadecimal digits. You should mention both or maybe neither, since these two are all the reasonable encodings that exist.
There was a problem hiding this comment.
You could equally well write
sequence of UTF-8 lowercase hexadecimal digits
This would be confusing.
First it's always better not to confuse code points and encodings, so something like a sequence of UTF-8 encoded Unicode lowercase hexadecimal digits should rather be written. But that would be, depending on your perspective, ambiguous or wrong.
Unicode has two properties to distinguish hexdecimal digits: Hex_Digit and ASCII_Hex_Digit. They do not coincide and for example U+FF45 has both the Lowercase property and Hex_Digit property. This function however does not handle this character.
I think the current wording is a precise and accurate description of the function and is consistent with other doc strings that work on US-ASCII (e.g. ascii_lowercase). The docs should rather educate people on the fact that US-ASCII code points are compatible with Unicode code points and that the US-ASCII encoding and UTF-8 encoding of these code points are equal.
stdlib/bytesLabels.mli
Outdated
| (** Convert a hexadecimal representation back into the corresponding byte | ||
| sequence. | ||
|
|
||
| @raise Invalid_argument if the argument is not an even number of hexadecimal |
There was a problem hiding this comment.
What @dra27 means is: if you use an ocamllex-generated lexer to input an even-length string of hexadecimal digits, then pass it to of_hex, you know for sure that it won't fail so the call to Result.get_ok is just noise.
|
Doesn't look like this is ever to get over the line, so closing! |
There's a lovely specialised function in
Digestfor printing 16-byte strings as hexadecimal. This PR generalises that function inString(and updatesDigestto use it).I'm not sure what the policy (or if there is one) for whether equivalent functions should be added in
Bytes.