[RFC] Add Bigstring module by nojb · Pull Request #1961 · ocaml/ocaml

nojb · 2018-08-02T13:20:55Z

This PR proposes the addition of a Bigstring module to the standard library (see MPR#7772). "Big strings" refer to one-dimensional big arrays of characters, and they are widely used for interfacing with C code, and in other contexts.

The new module Bigstring exposes an abstract type, and provides roughly the same functions as Bytes, a non-copying sub operation, plus binary numeric functions (as in #1864), and conversions to- and from- bytes, string, and bigarray types.

In addition, the new type is integrated with existing stdlib modules: Bytes.sub_bigstring is defined as well as I/O functions in both Stdlib and Unix. In order to define these functions in their natural modules and to bring the bigstring type to the same level as bytes and string, a new built-in, predefined type bigstring is defined in the compiler.

Missing:

Unix functions under Windows.
Tests

Feedback welcome! (in particular, whether the more invasive changes to the existing standard library and compiler are worthwhile).

/cc @hhugo @diml

hhugo · 2018-08-02T13:35:25Z

/cc @seliopou

hhugo · 2018-08-02T13:39:17Z

We'll also need an opam package base-bigstring for older version of the compiler.

alainfrisch · 2018-08-02T13:49:08Z

We'll also need an opam package base-bigstring for older version of the compiler.

Since when do we expect all changes to the stdlib to be made available to older versions of OCaml through OPAM? Why is adding a module different from adding a function to an existing one?

yallop · 2018-08-02T13:49:09Z

It would useful to have some indexing operators. One way of accomplishing that is to directly expose the equality between bigstring and Bigarray.Array1.t rather than hiding the type and exposing conversion functions.

It would also be useful to have the Bigstring module interface compatible with Bytes, as far as possible, so that people can switch between the two without much friction.

Finally, the naming is perhaps confusing, since strings are immutable, but bigstrings are mutable.

pmetzger · 2018-08-02T13:50:12Z

Finally, the naming is perhaps confusing, since strings are immutable, but bigstrings are mutable.

Bigbyte?

nojb · 2018-08-02T13:51:47Z

One way of accomplishing that is to directly expose the equality between bigstring and Bigarray.Array1.t

That was my initial intention, but it complicates the definition of functions on this type in Stdlib/Pervasives (e.g. output_bigstring, etc).

nojb · 2018-08-02T13:56:25Z

Finally, the naming is perhaps confusing, since strings are immutable, but bigstrings are mutable.

True. I chose this name since it is the one historically given to this concept, but I am open to suggestions! Some existing libraries that use this name:

hhugo · 2018-08-02T14:46:46Z

Since when do we expect all changes to the stdlib to be made available to older versions of OCaml
through OPAM? Why is adding a module different from adding a function to an existing one?

Let me rephrase with It would be nice. It would make the migration story much easier.

alainfrisch · 2018-08-02T15:25:15Z

I see, but doing the same for Option, Result, and all new functions would be nice as well. Of course, I've no problem (and I'm actually grateful) if people want to maintain backported forks of the stdlib for older versions of OCaml, as long as this does not slow down evolutions of the stdlib itself.

nojb · 2018-08-02T15:33:50Z

Re: backporting stdlib improvements, see also https://github.com/thierry-martinez/stdcompat

dbuenzli · 2018-08-02T15:37:17Z

I see, but doing the same for Option, Result, and all new functions would be nice as well.

This will be done in stdlib-shims, a PR for bigstring will be welcome aswell, if the primitives allow.

copy · 2018-08-06T23:42:29Z

stdlib/stdlib.mli

+val output_bigstring : out_channel -> bigstring -> int -> int -> unit
+(** Same as {!output}, but takes the data from a big string.
+
+    @since 4.08.0 *)


Give that it's easier to add new modules to the stdlib now, should we consider not polluting the global namespace with "IO" functions today, and instead add them to a separate module? (possibly later and outside of this MR)

Yes, we should consider it. However:

I don't see anyone working on this hypothetical IO module.

If and when such a module is introduced we will need to deprecate existing I/O functions in Stdlib anyway; adding one more function will not change anything much.

Personally, I am not super confident that such a module will land soon because, contrary to all the recently considered new modules (Float, Seq, Option, Result), the design space seems much larger and closely related to things such as effects and multicore, which will complicate matters.

Having said that, of course I am open to discussion, and if anyone is sitting on a IO proposal, I am happy to delay the inclusion of this function.

the design space seems much larger and closely related to things such as effects and multicore

We are indeed working on an effects-based IO library that supports direct-style function calls using effects. It's probably not worth doing a large stdlib rearrangement before that prototype is complete.

Well, conveniently this just came up: #640 (comment). And I think this is another point in favour of not adding the IO functions in Stdlib:

One way of accomplishing that is to directly expose the equality between bigstring and Bigarray.Array1.t

That was my initial intention, but it complicates the definition of functions on this type in Stdlib/Pervasives (e.g. output_bigstring, etc).

avsm · 2018-08-07T15:07:15Z

I've looked over the interface, and it looks suitable to port Mirage's cstruct over to use this module instead. In particular, we need the efficient conversion functions to/from int16/32 that were formerly the domain of Bigarray only.

pmetzger · 2018-08-07T17:42:02Z

I'll again suggest "Bigbytes", as it's much more bytes than string...

copy · 2018-08-08T18:03:28Z

stdlib/bigstring.ml

+
+let set_int64_be b i x =
+  if not Sys.big_endian then set_int64_ne b i (swap64 x)
+  else set_int64_ne b i x


Could these be written as:

let set_int64_be = if not Sys.big_endian then set_int64_swap else set_int64_ne

I.e., moving the endianness check to the top-level?

Yes, they could, but I don't think it amounts to anything: under ocamlopt both variants generate the same code thanks to inlining. And even under ocamlc you are trading a branch on Sys.big_endian for an extra function call to set_int64_swap.

In this particular case, the two ways to write the code give exactly the same result: the test is reduced at compile-time, and the branches in your version are external primitives, so they are eta-expanded when they are not applied.

dinosaure · 2018-08-09T14:36:29Z

stdlib/bigstring.mli

+    [dst]. *)
+
+val blit_bytes : bytes -> int -> t -> int -> int -> unit
+(** Same as {!blit}, but read data from a byte sequence. *)


You should precise than this function (instead blit) use internally memcpy.

Good point, thanks.

xavierleroy · 2018-08-16T08:33:17Z

The name Bigstring is cognitive dissonance. As @nojb and @pmetzger noticed it's not about strings. It's not really about Big either: yes, this library can be used to circumvent the limitations on string and bytes lengths on 32-bit platforms, which are on their way out, but on 64-bit platforms the sky is the limit for the length of strings and bytes.

I agree the main motivation for these bigarrays-of-bytes is interface with systems code. But we have 3rd-party llibraries that do this very well already, like cstruct (https://github.com/mirage/ocaml-cstruct). Is the need so strong that it must be put in the standard library? How will nonexpert users know how to choose between bytes and externally-allocated-bytes (what you call "bigstrings")?

Finally, the last time I mentioned using bigarrays-of-bytes internally in the compiler, @yminsky lectured me on how bad memory management for bigarrays is. I know we're making progress on this front, and the Mirage people don't seem to care, as they allocate small bigarrays like crazy. Still, before blessing the use of bigarrays-of-bytes as an alternative to bytes for interfacing with C code, we need to be very sure they no longer cause memory management issues.

dinosaure · 2018-08-16T14:22:48Z

Mirage people don't seem to care, as they allocate small bigarrays like crazy.

As a Mirage guy, this is my biggest problem about some implementations like ocaml-git. Some PR were already done to try to fix this problem however nobody found a good solution about that. I will happy if we can re-open this topic.

gasche · 2018-08-16T19:24:26Z

@dinosaure: the bigarray-allocation issues should have been largely fixed by GPR #1476, which was part of the 4.07 release. Have you tried again with 4.07 to see if you still had memory-management problems with bigarrays? No matter what the answer to this question is, the discussion a bit off-topic here, could we continue discussion bigarray (and bigstring) memory-allocation policies in the proper Mantis ticket, MPR#7100?

pmetzger · 2018-08-17T17:59:55Z

BTW, I get that I'm in the minority, but I still think having inconsistent naming (i.e. naming a thing with mutability "string") is unfortunate.

nojb · 2018-08-23T17:35:23Z

Thanks for the discussion everybody!

While personally I think that having bigarrays-of-bytes in the stdlib as the standard way to interface with C code would be useful, its addition looks premature at this time. As a bare minimum, a confirmation that memory management issues have been solved is necessary to move forward. Since I don't have the time to confirm this right now, I am closing this PR.

Discussion can be re-opened in the future as needed.

nojb added 3 commits August 2, 2018 15:13

Add bigstring C primitives

37345fb

Add predefined type bigstring

c1b2875

Add Bigstring module

4356f71

nojb added the stdlib label Aug 2, 2018

nojb added 6 commits August 2, 2018 21:52

Add Bytes.blit_bigstring

c2c2750

Add Stdlib.{output_bigstring,input_bigstring,really_input_bigstring}

90b9a85

Add Unix.{read_bigstring,write_bigstring,single_write_bigstring}

9902464

make alldepend

0f2c8e8

Bootstrap

eb839fd

Update Changes

1737b0b

nojb force-pushed the bigstring branch from 7f4de5d to 1737b0b Compare August 2, 2018 19:58

copy reviewed Aug 6, 2018

View reviewed changes

copy reviewed Aug 8, 2018

View reviewed changes

dinosaure reviewed Aug 9, 2018

View reviewed changes

pmetzger mentioned this pull request Aug 16, 2018

bigstring ocaml-community/meta#8

Closed

nojb closed this Aug 23, 2018

damiendoligez mentioned this pull request Nov 5, 2018

Add caml_alloc_custom_mem #1738

Merged

vicuna mentioned this pull request Mar 14, 2019

Add a bigstring module to the stdlib #7772

Closed

hannesm mentioned this pull request Mar 20, 2019

First draft to add capabilities on cstruct mirage/ocaml-cstruct#237

Merged

nojb mentioned this pull request Jul 7, 2023

Get rid of the LongString module #12360

Merged

dbuenzli mentioned this pull request Jun 4, 2025

Add {get,set}_int{16,32,64}_{le,be} to Bigarray.Array1? #14069

Closed

johnyob mentioned this pull request Dec 23, 2025

feat: [@@unstable] features #14430

Closed

Conversation

nojb commented Aug 2, 2018

Uh oh!

hhugo commented Aug 2, 2018

Uh oh!

hhugo commented Aug 2, 2018

Uh oh!

alainfrisch commented Aug 2, 2018

Uh oh!

yallop commented Aug 2, 2018

Uh oh!

pmetzger commented Aug 2, 2018

Uh oh!

nojb commented Aug 2, 2018

Uh oh!

nojb commented Aug 2, 2018

Uh oh!

hhugo commented Aug 2, 2018

Uh oh!

alainfrisch commented Aug 2, 2018

Uh oh!

nojb commented Aug 2, 2018

Uh oh!

dbuenzli commented Aug 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avsm commented Aug 7, 2018

Uh oh!

pmetzger commented Aug 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Aug 16, 2018

Uh oh!

dinosaure commented Aug 16, 2018

Uh oh!

gasche commented Aug 16, 2018

Uh oh!

pmetzger commented Aug 17, 2018

Uh oh!

nojb commented Aug 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

dbuenzli commented Aug 2, 2018 •

edited

Loading