Skip to content

[RFC] Add Bigstring module#1961

Closed
nojb wants to merge 9 commits intoocaml:trunkfrom
nojb:bigstring
Closed

[RFC] Add Bigstring module#1961
nojb wants to merge 9 commits intoocaml:trunkfrom
nojb:bigstring

Conversation

@nojb
Copy link
Copy Markdown
Contributor

@nojb nojb commented Aug 2, 2018

This PR proposes the addition of a Bigstring module to the standard library (see MPR#7772). "Big strings" refer to one-dimensional big arrays of characters, and they are widely used for interfacing with C code, and in other contexts.

The new module Bigstring exposes an abstract type, and provides roughly the same functions as Bytes, a non-copying sub operation, plus binary numeric functions (as in #1864), and conversions to- and from- bytes, string, and bigarray types.

In addition, the new type is integrated with existing stdlib modules: Bytes.sub_bigstring is defined as well as I/O functions in both Stdlib and Unix. In order to define these functions in their natural modules and to bring the bigstring type to the same level as bytes and string, a new built-in, predefined type bigstring is defined in the compiler.

Missing:

  • Unix functions under Windows.
  • Tests

Feedback welcome! (in particular, whether the more invasive changes to the existing standard library and compiler are worthwhile).

/cc @hhugo @diml

@hhugo
Copy link
Copy Markdown
Contributor

hhugo commented Aug 2, 2018

/cc @seliopou

@hhugo
Copy link
Copy Markdown
Contributor

hhugo commented Aug 2, 2018

We'll also need an opam package base-bigstring for older version of the compiler.

@alainfrisch
Copy link
Copy Markdown
Contributor

We'll also need an opam package base-bigstring for older version of the compiler.

Since when do we expect all changes to the stdlib to be made available to older versions of OCaml through OPAM? Why is adding a module different from adding a function to an existing one?

@yallop
Copy link
Copy Markdown
Member

yallop commented Aug 2, 2018

It would useful to have some indexing operators. One way of accomplishing that is to directly expose the equality between bigstring and Bigarray.Array1.t rather than hiding the type and exposing conversion functions.

It would also be useful to have the Bigstring module interface compatible with Bytes, as far as possible, so that people can switch between the two without much friction.

Finally, the naming is perhaps confusing, since strings are immutable, but bigstrings are mutable.

@pmetzger
Copy link
Copy Markdown
Member

pmetzger commented Aug 2, 2018

Finally, the naming is perhaps confusing, since strings are immutable, but bigstrings are mutable.

Bigbyte?

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Aug 2, 2018

One way of accomplishing that is to directly expose the equality between bigstring and Bigarray.Array1.t

That was my initial intention, but it complicates the definition of functions on this type in Stdlib/Pervasives (e.g. output_bigstring, etc).

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Aug 2, 2018

Finally, the naming is perhaps confusing, since strings are immutable, but bigstrings are mutable.

True. I chose this name since it is the one historically given to this concept, but I am open to suggestions! Some existing libraries that use this name:

@nojb nojb added the stdlib label Aug 2, 2018
@hhugo
Copy link
Copy Markdown
Contributor

hhugo commented Aug 2, 2018

Since when do we expect all changes to the stdlib to be made available to older versions of OCaml
through OPAM? Why is adding a module different from adding a function to an existing one?

Let me rephrase with It would be nice. It would make the migration story much easier.

@alainfrisch
Copy link
Copy Markdown
Contributor

I see, but doing the same for Option, Result, and all new functions would be nice as well. Of course, I've no problem (and I'm actually grateful) if people want to maintain backported forks of the stdlib for older versions of OCaml, as long as this does not slow down evolutions of the stdlib itself.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Aug 2, 2018

Re: backporting stdlib improvements, see also https://github.com/thierry-martinez/stdcompat

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Aug 2, 2018

I see, but doing the same for Option, Result, and all new functions would be nice as well.

This will be done in stdlib-shims, a PR for bigstring will be welcome aswell, if the primitives allow.

val output_bigstring : out_channel -> bigstring -> int -> int -> unit
(** Same as {!output}, but takes the data from a big string.

@since 4.08.0 *)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give that it's easier to add new modules to the stdlib now, should we consider not polluting the global namespace with "IO" functions today, and instead add them to a separate module? (possibly later and outside of this MR)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should consider it. However:

  • I don't see anyone working on this hypothetical IO module.
  • If and when such a module is introduced we will need to deprecate existing I/O functions in Stdlib anyway; adding one more function will not change anything much.
  • Personally, I am not super confident that such a module will land soon because, contrary to all the recently considered new modules (Float, Seq, Option, Result), the design space seems much larger and closely related to things such as effects and multicore, which will complicate matters.

Having said that, of course I am open to discussion, and if anyone is sitting on a IO proposal, I am happy to delay the inclusion of this function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the design space seems much larger and closely related to things such as effects and multicore

We are indeed working on an effects-based IO library that supports direct-style function calls using effects. It's probably not worth doing a large stdlib rearrangement before that prototype is complete.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, conveniently this just came up: #640 (comment). And I think this is another point in favour of not adding the IO functions in Stdlib:

One way of accomplishing that is to directly expose the equality between bigstring and Bigarray.Array1.t

That was my initial intention, but it complicates the definition of functions on this type in Stdlib/Pervasives (e.g. output_bigstring, etc).

@avsm
Copy link
Copy Markdown
Member

avsm commented Aug 7, 2018

I've looked over the interface, and it looks suitable to port Mirage's cstruct over to use this module instead. In particular, we need the efficient conversion functions to/from int16/32 that were formerly the domain of Bigarray only.

@pmetzger
Copy link
Copy Markdown
Member

pmetzger commented Aug 7, 2018

I'll again suggest "Bigbytes", as it's much more bytes than string...


let set_int64_be b i x =
if not Sys.big_endian then set_int64_ne b i (swap64 x)
else set_int64_ne b i x
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these be written as:

let set_int64_be =
   if not Sys.big_endian then set_int64_swap
   else set_int64_ne

I.e., moving the endianness check to the top-level?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they could, but I don't think it amounts to anything: under ocamlopt both variants generate the same code thanks to inlining. And even under ocamlc you are trading a branch on Sys.big_endian for an extra function call to set_int64_swap.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, the two ways to write the code give exactly the same result: the test is reduced at compile-time, and the branches in your version are external primitives, so they are eta-expanded when they are not applied.

[dst]. *)

val blit_bytes : bytes -> int -> t -> int -> int -> unit
(** Same as {!blit}, but read data from a byte sequence. *)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should precise than this function (instead blit) use internally memcpy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks.

@xavierleroy
Copy link
Copy Markdown
Contributor

The name Bigstring is cognitive dissonance. As @nojb and @pmetzger noticed it's not about strings. It's not really about Big either: yes, this library can be used to circumvent the limitations on string and bytes lengths on 32-bit platforms, which are on their way out, but on 64-bit platforms the sky is the limit for the length of strings and bytes.

I agree the main motivation for these bigarrays-of-bytes is interface with systems code. But we have 3rd-party llibraries that do this very well already, like cstruct (https://github.com/mirage/ocaml-cstruct). Is the need so strong that it must be put in the standard library? How will nonexpert users know how to choose between bytes and externally-allocated-bytes (what you call "bigstrings")?

Finally, the last time I mentioned using bigarrays-of-bytes internally in the compiler, @yminsky lectured me on how bad memory management for bigarrays is. I know we're making progress on this front, and the Mirage people don't seem to care, as they allocate small bigarrays like crazy. Still, before blessing the use of bigarrays-of-bytes as an alternative to bytes for interfacing with C code, we need to be very sure they no longer cause memory management issues.

@dinosaure
Copy link
Copy Markdown

Mirage people don't seem to care, as they allocate small bigarrays like crazy.

As a Mirage guy, this is my biggest problem about some implementations like ocaml-git. Some PR were already done to try to fix this problem however nobody found a good solution about that. I will happy if we can re-open this topic.

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 16, 2018

@dinosaure: the bigarray-allocation issues should have been largely fixed by GPR #1476, which was part of the 4.07 release. Have you tried again with 4.07 to see if you still had memory-management problems with bigarrays? No matter what the answer to this question is, the discussion a bit off-topic here, could we continue discussion bigarray (and bigstring) memory-allocation policies in the proper Mantis ticket, MPR#7100?

@pmetzger
Copy link
Copy Markdown
Member

BTW, I get that I'm in the minority, but I still think having inconsistent naming (i.e. naming a thing with mutability "string") is unfortunate.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Aug 23, 2018

Thanks for the discussion everybody!

While personally I think that having bigarrays-of-bytes in the stdlib as the standard way to interface with C code would be useful, its addition looks premature at this time. As a bare minimum, a confirmation that memory management issues have been solved is necessary to move forward. Since I don't have the time to confirm this right now, I am closing this PR.

Discussion can be re-opened in the future as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.