-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Functions to read/write binary representations of numbers #1864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I strongly support adding these functions to the stdlib.
and define modules |
type _ number_ty = Int16: int number_ty | Int32: int32 number_ty | ...
val Buffer.add_number: ?endian:[`LE|`BE](*default:Native*) -> Buffer.t -> 'a number_ty -> 'a -> unit
|
|
Yes another approach would be to expose only functions using the native endianness, plus: val Int32.Endian.little: int32 -> int32
(** Modify the value, if needed, so that functions such as Bytes.set_int32 produce a little-endian
representation of the integer, independently of the native encoding of the host machine. Similarly,
this function can be applied to the result of functions such as Bytes.get_int32 when reading
a 32-bit integer in little-endian. *)
val Int32.Endian.big: int32 -> int32
(** .... *)(and same for 16-bit and 64-bit) |
|
This looks like a very useful addition. Even if unsafe versions are not exposed, I think it would make (I would also consider adding |
|
In my experience what you need most of the time is the explicit endianness (network, codecs, etc.) and it's less error prone if the native endianness has as an explicit suffix; otherwise you tend to use it without thinking about the issue (or it will be the first one to auto-complete...) and you likely end up with a bug. @alainfrisch I think your last suggestion would make things more confusing than enlighting. Personally I don't mind the number of functions. I think this should be simply: I don't think it's worth adding unsafe versions, people willing to go into unsafe territory can play the primitive game. I don't think it's worth adding the float versions this is an |
I was concerned that this would cause extra boxing of the intermediate int64 if the function (e.g. Buffer.add_int64) is not inlined. To be checked. |
|
Ok, it's seem fine to let the user do that float->int64 mapping (checked only with the "Closure" inliner, but I'm sure flambda will be ok), with default inlining settings. |
|
I also check that the check for Sys.big_endian is eliminated in e.g. For the naming: are _be, _le, _ne good enough? Concerning the scope of this PR:
|
|
One more point: it could be useful to provide both get_int16 (sign extension) and get_uint16 (always returning positive integers). |
|
Agreed on all points. I think it would be good to add the functions in
Yes ! FWIW I have this in an unpublished code of similar functionality. It's the kind of bit fiddling things you prefer to be handled by the library. |
stdlib/bytes.mli
Outdated
|
|
||
| (** {6 Binary storage of numbers} *) | ||
|
|
||
| external get_int16_ne : bytes -> int -> int = "%caml_bytes_get16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC there's no advantage to expose this as an external at the .mli level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing the external in the .mli (and hence in the .cmi) allows every use of this function to compile a call to the primitive directly (no passing through a wrapper function). Same as inlining, but this works even if inlining does not take place (e.g. in bytecode).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an important property, though? Because the cost is that if we ever decide to change the implementation (for example wrapping it into a bound-checking or exception-changing routine) or primitive name, we break the interface, and that seems quite high to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand: is there a way to write code that would break if one changes the signature from external to val ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some people like to copy the interfaces from the stdlib for their own implementations, and they would get bitten by this.
Other people avoid this problem by includeing the interfaces, and they get bitten every time we add a function or even an optional parameter to an existing function.
So I would like to see the externals removed from the .mlis. Bytecode speed is not an issue nowadays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess @gasche was referring to something like:
module F (X : sig
external f : unit -> unit = "%ignore"
end) = struct
end
module OK = F (struct
external f : unit -> unit = "%ignore"
end)
module KO = F (struct
let f () = ()
end)
|
Two comments:
|
I don't understand what this is supposed to bring to the discussion. Would you care to expand maybe ? |
|
|
For comparison purposes, I should point out that I have implemented logic for this purpose in my Orsetto project. Start here for details if you're interested. Accordingly, I have no strong opinion on how this pull request might be evaluated. My unsolicited opinions on the various questions discussed above summarized:
|
val Buffer.add_int64: [`LE|`BE|`NE] -> t -> int64 -> unit
let[@ocaml.inline] add_int64 e b x =
let new_position = b.position + 8 in
if new_position > b.length then resize b 8;
unsafe_set_int64 b.buffer b.position
(match e, Sys.big_endian with
| `LE, true | `BE, false -> swap64 x
| _ -> x);
b.position <- new_positionand thanks to the inline annotation, this gives the same performance than ad hoc functions when endianness is known statically.
|
I think having all the function is in general more predictable. Except for guaranteed statically known checks done at the toplevel (like Personally I don't have a problem with all the additions ( |
|
@alainfrisch no, I mean that you should provide three variations and require the choice made explicitly in the programming interface:
That's what I did, and I think it was the right choice. Another thing I did was [basically] this: Usually, the byte order is known statically. I think if dynamic byte order is really on the menu, then it's best to use a function that selects the right module. |
|
@alainfrisch any progress on that one ? I would like to rely on this for further stdlib improvements. |
|
Remaining points:
|
Reviewed, see my comments. I think we should keep the doc strings simple using the patterns I mentioned. |
658f94e to
ba03e5a
Compare
|
Just one minor pending point about the use of Char.unsafe_chr. Waiting for @dbuenzli feedback on that one. |
dbuenzli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we are good to go. Thanks for putting the work in this.
|
Thanks for your help with the design and documentation! |
MPR#5771 introduced primitives to read/write binary representations of 16, 32 and 64-bit integers directly to/from (big)strings (and now bytes), which is very useful for any program that needs to serialize numbers efficiently. Those primitives are exposed only in the ocplib-endian library, but I believe most people would expect to find the functionality in the stdlib, and this is easy since the hard work of supporting the primitives in the compiler and runtime system is behind us. Moreover, extending Buffer directly allows for more efficient functions than one could get with an external library.
This PR is an attempt to launch the discussion around the addition of such features. It is far from ready to be merged! Currently, it adds:
bytes(only versions with bound checks are exposed).Buffer.t; also single-precision and double-precision floating point numbers.These function all use the native endianness of the current machine. This might be problematic for some use cases, but perfectly fine for others (and better, performance-wise), and one could consider adding versions with explicit endianness (as ocp-endian does).
Missing:
bytesvalues.stringvalues.The performance benefits of the new functions can be illustrated with the following micro-benchmark:
and its variants using:
The version using the new function runs about 6.5 times faster on my machine.