Functions to read/write binary representations of numbers #1864

alainfrisch · 2018-06-26T21:12:03Z

MPR#5771 introduced primitives to read/write binary representations of 16, 32 and 64-bit integers directly to/from (big)strings (and now bytes), which is very useful for any program that needs to serialize numbers efficiently. Those primitives are exposed only in the ocplib-endian library, but I believe most people would expect to find the functionality in the stdlib, and this is easy since the hard work of supporting the primitives in the compiler and runtime system is behind us. Moreover, extending Buffer directly allows for more efficient functions than one could get with an external library.

This PR is an attempt to launch the discussion around the addition of such features. It is far from ready to be merged! Currently, it adds:

Functions to read/write 16-, 32- and 64-bit "binary" integer to bytes (only versions with bound checks are exposed).
Functions to append 16-, 32- and 64-bit "binary" integer to Buffer.t; also single-precision and double-precision floating point numbers.

These function all use the native endianness of the current machine. This might be problematic for some use cases, but perfectly fine for others (and better, performance-wise), and one could consider adding versions with explicit endianness (as ocp-endian does).

Missing:

Functions to read/write floats to/from bytes values.
Functions to read numbers from string values.
Functions to read/write floats to/from bigarrays of chars.
Variants without bound checks.
Perhaps also functions to read/write from/to channels.

The performance benefits of the new functions can be illustrated with the following micro-benchmark:

let () =
  let a = Float.Array.create 1000 in
  for j = 0 to Float.Array.length a - 1 do
    Float.Array.set a j (float j)
  done;
  let b = Buffer.create 100 in
  for i = 1 to 100000 do
    Buffer.clear b;
    for j = 0 to Float.Array.length a - 1 do
      Buffer.add_float64_bin b (Float.Array.get a j)
    done;
  done

and its variants using:

let add_float64_bin b x =
  let i = Int64.bits_of_float x in
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand i 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*1)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*2)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*3)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*4)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*5)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*6)) 0xffL)));
  Buffer.add_char b (Char.unsafe_chr (Int64.to_int (Int64.logand (Int64.shift_right_logical i (8*7)) 0xffL)))

The version using the new function runs about 6.5 times faster on my machine.

nojb · 2018-06-26T21:33:25Z

I strongly support adding these functions to the stdlib.

I think adding little endian and big endian variants would be quite useful : a very common use-case for these functions is interfacing with binary protocols and/or formats in which case one always needs to use a specific endianness. A suffix _le and _be could be used. Adding the three versions (little endian, big endian, and host endian) may be optimal but a bit excessive.
For the case of bigstrings (1-dimensional bigarrays), I think that adding a dedicated Bigstring module makes sense, containing roughly the union of operations on bytes and Bigarray. See also MPR#7772.
For naming, is the suffix _bin really necessary? What about introducing a signature

module type S = sig
  type t
  val get_int16: t -> int
  val get_int32: t -> int32
  val get_int64: t -> int64
  val set_int16: t -> int -> unit
  val set_int32: t -> int32 -> unit
  val set_int64: t -> int64 -> unit
  (* etc *)
end

and define modules Bytes.BE : S with t := bytes, Bytes.LE : S with t := bytes, same for Buffer (with t := Buffer.t, etc), and so on. Unfortunately, this signature does not work for the case of channels (where the type t is different for get and set functions (maybe splitting the signature in an "input" part and an "output" part would be better to also handle the case of string).

alainfrisch · 2018-06-26T21:57:07Z

I tend to agree that exposing versions with explicit endianness would be good, but I think it's good to keep the version with the native endianness of the machine (which is useful e.g. for inter-process communication on the same machine, or to interact with buffers containing native "C" data). The number of functions is a bit worrying, though, between int16/int32/int64/float32/float64, le/be/native, string/bytes/bigstring/buffer(/channel), possibly safe/unsafe.
An alternative could be to expose fewer functions, taking arguments, and hope/check that the inliner will be good enough to eliminate the overhead. E.g.:

type _ number_ty = Int16: int number_ty | Int32: int32 number_ty | ...
val Buffer.add_number: ?endian:[`LE|`BE](*default:Native*) -> Buffer.t -> 'a number_ty -> 'a -> unit

About the _bin suffix: I had in mind Stdlib.print_int and co, working with the ASCII decimal representation. But perhaps we don't care.
I forgot about input_binary_int/output_binary_int, which we already have: big-endian, 4 bytes, but working with int, not int32.

alainfrisch · 2018-06-27T06:55:18Z

Yes another approach would be to expose only functions using the native endianness, plus:

 val Int32.Endian.little: int32 -> int32
   (** Modify the value, if needed, so that functions such as Bytes.set_int32 produce a little-endian
        representation of the integer, independently of the native encoding of the host machine. Similarly,
        this function can be applied to the result of functions such as Bytes.get_int32 when reading
        a 32-bit integer in little-endian. *)

 val Int32.Endian.big: int32 -> int32
    (** .... *)

(and same for 16-bit and 64-bit)

xclerc · 2018-06-27T09:04:48Z

This looks like a very useful addition.

Even if unsafe versions are not exposed, I think it would make
sense to avoid some bound checks in Buffer.add_xyz (they
are avoided in Buffer.add_char by using Bytes.unsafe_set
rather than Bytes.set).

(I would also consider adding add_int8/get_int8/set_int8.
They would of course be mere aliases to existing functions,
but would allow one to write more uniform code when working
on file formats or network protocols.)

dbuenzli · 2018-06-27T09:10:22Z

In my experience what you need most of the time is the explicit endianness (network, codecs, etc.) and it's less error prone if the native endianness has as an explicit suffix; otherwise you tend to use it without thinking about the issue (or it will be the first one to auto-complete...) and you likely end up with a bug.

@alainfrisch I think your last suggestion would make things more confusing than enlighting. Personally I don't mind the number of functions. I think this should be simply:

{add,get,set}_int16_{ne,be,le} 
{add,get,set}_int32_{ne,be,le}
{add,get,set}_int64_{ne,be,le}

I don't think it's worth adding unsafe versions, people willing to go into unsafe territory can play the primitive game. I don't think it's worth adding the float versions this is an {Int64,Int32}.{bits_of_float,float_of_bits} away.

alainfrisch · 2018-06-27T09:25:52Z

I don't think it's worth adding the float versions this is an {Int64,Int32}.{bits_of_float,float_of_bits} away.

I was concerned that this would cause extra boxing of the intermediate int64 if the function (e.g. Buffer.add_int64) is not inlined. To be checked.

alainfrisch · 2018-06-27T10:06:52Z

Ok, it's seem fine to let the user do that float->int64 mapping (checked only with the "Closure" inliner, but I'm sure flambda will be ok), with default inlining settings.

alainfrisch · 2018-06-27T10:58:45Z

I also check that the check for Sys.big_endian is eliminated in e.g. Buffer.add_int64_le, and that the performance is the same as Buffer.add_int64_ne (on a little-endian system).

For the naming: are _be, _le, _ne good enough?

Concerning the scope of this PR:

I suggest to keep the idea of providing channel operations for later.
Bigstring, which would be a whole new module, would deserve its own PR.
Unsafe versions are indeed not needed (the impact on performance is rather small anyway).
Perhaps adding get_int* functions to String makes sense, to keep it in sync with Bytes, even though most people would probably use bytes anyway for parsing binary data, I guess.

alainfrisch · 2018-06-27T11:00:48Z

One more point: it could be useful to provide both get_int16 (sign extension) and get_uint16 (always returning positive integers).

dbuenzli · 2018-06-27T11:15:46Z

Agreed on all points. I think it would be good to add the functions in String though.

One more point: it could be useful to provide both get_int16 (sign extension) and get_uint16 (always returning positive integers).

Yes ! FWIW I have this in an unpublished code of similar functionality. It's the kind of bit fiddling things you prefer to be handled by the library.

dbuenzli · 2018-06-27T11:20:09Z

stdlib/bytes.mli


+(** {6 Binary storage of numbers} *)
+
+external get_int16_ne : bytes -> int -> int = "%caml_bytes_get16"


IIRC there's no advantage to expose this as an external at the .mli level.

Exposing the external in the .mli (and hence in the .cmi) allows every use of this function to compile a call to the primitive directly (no passing through a wrapper function). Same as inlining, but this works even if inlining does not take place (e.g. in bytecode).

Is this an important property, though? Because the cost is that if we ever decide to change the implementation (for example wrapping it into a bound-checking or exception-changing routine) or primitive name, we break the interface, and that seems quite high to me.

I am not sure I understand: is there a way to write code that would break if one changes the signature from external to val ?

Some people like to copy the interfaces from the stdlib for their own implementations, and they would get bitten by this.

Other people avoid this problem by includeing the interfaces, and they get bitten every time we add a function or even an optional parameter to an existing function.

So I would like to see the externals removed from the .mlis. Bytecode speed is not an issue nowadays.

I guess @gasche was referring to something like:

module F (X : sig external f : unit -> unit = "%ignore" end) = struct end module OK = F (struct external f : unit -> unit = "%ignore" end) module KO = F (struct let f () = () end)

pmetzger · 2018-06-27T14:09:11Z

Two comments:

Why not include 8 bit quantities as well? Yes, the function to do so is trivial, but it will make some protocol handling code look more uniform.
About endianness, I'm always reminded of Rob Pike's insight that code that knows about endianness (other than compilers) is generally badly written. Here's his blog post on the topic. The code is all written in C but it's simple enough not to matter. https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html

dbuenzli · 2018-06-27T14:23:57Z

About endianness, I'm always reminded of Rob Pike's insight that code that knows about endianness (other than compilers) is generally badly written.

I don't understand what this is supposed to bring to the discussion. Would you care to expand maybe ?

alainfrisch · 2018-06-27T21:39:21Z

Are we ok with 21 new functions just in Bytes (12 get, 9 set)? Should we organize them in submodules (per endianness) as suggested by @nojb? (I generally prefer having explicit signatures.)
Assuming there is no runtime overhead, would it be good to pass the endianness choice (ne/le/be) as an argument, in order to divide the number of functions by 3?
What about adding int8/uint8 variants?
I believe that using Sys.int_size instead of Sys.word_size (as done in ocplib-endian) is better for JS backends, but that would need to be checked.

jhwoodyatt · 2018-06-27T21:55:41Z

For comparison purposes, I should point out that I have implemented logic for this purpose in my Orsetto project. Start here for details if you're interested.

Accordingly, I have no strong opinion on how this pull request might be evaluated. My unsolicited opinions on the various questions discussed above summarized:

unsafe interfaces are useful in the case where a higher-level interface is ensuring that data format invariants are respected.
byte-order should not by inferred from the system architecture when formatting data for interchange, i.e. it should always be exposed in the programming interface without a default.
formatting IEEE-754 floating point for interchange should provide support for binary16, binary32 and binary64 formats.

alainfrisch · 2018-06-27T22:01:06Z

I've tried with

val Buffer.add_int64: [`LE|`BE|`NE] -> t -> int64 -> unit

let[@ocaml.inline] add_int64 e b x =
  let new_position = b.position + 8 in
  if new_position > b.length then resize b 8;
  unsafe_set_int64 b.buffer b.position
    (match e, Sys.big_endian with
     | `LE, true | `BE, false -> swap64 x
     | _ -> x);
  b.position <- new_position

and thanks to the inline annotation, this gives the same performance than ad hoc functions when endianness is known statically.

@jhwoodyatt : do you mean that we should not expose "native endianness" functions, and let the user rely on Sys.big_endian to choose between little/big endian variants if he really wants to use the native endianness (e.g. to interact with C datastructures)? This indeed makes it more explicit which parts of the code depends on the local native endianness.

dbuenzli · 2018-06-27T22:25:02Z

and thanks to the inline annotation, this gives the same performance than ad hoc functions when endianness is known statically.

I think having all the function is in general more predictable. Except for guaranteed statically known checks done at the toplevel (like Sys.big_endian) I'm a little bit wary on relying too much on the compiler optimizations to get the baseline performance. Suppose I want to make something generic on endianness, with the current proposal I will abstract over the functions. With your last proposal I might abstract over [LE|NE|BE] which (I guess) might give poor performance if I'm not careful.

Personally I don't have a problem with all the additions (uint8 and int8 is something is also something I have in that unpublished code if you want to add them) and I personally don't see a huge benefit of gathering those in submodules. Having everything flat, including native endianness, is nice to write and obvious to read in my opinion.

jhwoodyatt · 2018-06-27T22:58:26Z

@alainfrisch no, I mean that you should provide three variations and require the choice made explicitly in the programming interface:

Little-endian byte order
Big-endian byte order (or so-called "network byte order")
System architecture byte order

That's what I did, and I think it was the right choice. Another thing I did was [basically] this:

type t = [ `BE | `LE | `SE ]
val create: [< t ] -> (module Endian)

Usually, the byte order is known statically. I think if dynamic byte order is really on the menu, then it's best to use a function that selects the right module.

dbuenzli · 2018-07-30T07:50:35Z

@alainfrisch any progress on that one ? I would like to rely on this for further stdlib improvements.

alainfrisch · 2018-07-30T09:23:15Z

Remaining points:

Write the doc for all functions. I'd like to get feedback on the existing text before copying them furiously to all other functions
Port functions from Bytes to String (once the exact design/doc for Bytes is stable)
Add tests

stdlib/buffer.mli

stdlib/bytes.mli

dbuenzli · 2018-07-30T10:05:59Z

Write the doc for all functions. I'd like to get feedback on the existing text before copying them furiously to all other functions

Reviewed, see my comments. I think we should keep the doc strings simple using the patterns I mentioned.

alainfrisch · 2018-11-08T12:24:32Z

Just one minor pending point about the use of Char.unsafe_chr. Waiting for @dbuenzli feedback on that one.

dbuenzli

I'd say we are good to go. Thanks for putting the work in this.

alainfrisch · 2018-11-08T14:50:04Z

Thanks for your help with the design and documentation!

alainfrisch added the stdlib label Jun 26, 2018

dbuenzli reviewed Jun 27, 2018

View reviewed changes