Skip to content

Stdlib functional iterators#1002

Merged
alainfrisch merged 25 commits intoocaml:trunkfrom
c-cube:stdlib-iterators
Mar 16, 2018
Merged

Stdlib functional iterators#1002
alainfrisch merged 25 commits intoocaml:trunkfrom
c-cube:stdlib-iterators

Conversation

@c-cube
Copy link
Copy Markdown
Contributor

@c-cube c-cube commented Jan 10, 2017

See #635 for previous discussion and design.

This new proposal uses a purely functional type of delayed lists, as suggested by @fpottier
and already implemented in Batteries. The module is named Seq.

type 'a node =
  | Nil
  | Cons of 'a * 'a t

and 'a t = unit -> 'a node

The type is not structural, sadly (using type 'a t = unit -> ('a * 'a t) option does not pass the acyclicity check), but is very simple to use, and has decent performances (see previous PR). Ping @Drup @alainfrisch @gasche @jakemcarthur @paurkedal for more discussions.

About the choice of unit -> 'a node rather than 'a node lazy_t: it is much faster if the iterator is traversed once, and it entails fewer memory leaks (e.g. for range 0 1_000_000, there will be no memoization). One can still write a memoization combinator (I wrote one for gen which is very efficient, using an underlying buffer).

Copy link
Copy Markdown

@paurkedal paurkedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally think this approach looks good; since this is meant as a generic way to get out data from containers, I prefer it simple and performant rather than abstract (which I agree would be needed) and featureful. Though, I don't know how the performance compares to the previous PR.

hashtbl \
int32 \
int64 \
iter \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be seq, right?


let of_seq i =
let n = ref 0 in
let buf = ref (make 256 '\000') in
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be simpler to use Buffer, since geometric resizing is used anyway. (But good you avoid double iteration, since we don't know the complexity of traversing the input.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, relying on Buffer would be more convenient, but it would create circular dependencies since Buffer already depends on Bytes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Your partial reimplementation is quite lean anyway.

max_bucket_length = mbl;
bucket_histogram = histo }

(** {6 Seq.tors} *)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search & replace error in heading?

stdlib/map.mli Outdated
in ascending order, from key [k] or above.
@since NEXT_RELEASE *)

val add_seq : 'a t -> (key * 'a) Seq.t -> 'a t
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val add_seq : (key * 'a) Seq.t -> 'a t -> 'a t for consistency with add and to make the partial application composable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, will do.

stdlib/set.mli Outdated
(** Iterate on the whole set, in ascending order
@since NEXT_RELEASE *)

val add_seq : t -> elt Seq.t -> t
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_seq : elt Seq.t -> t -> t, cf comment on Map.add_seq.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in last commit

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Jan 11, 2017

There were some basic performance comparisons in #635. As far as map, flat_map and the likes are concerned, @Drup made some benchmarks a while ago that put the various types discussed previously roughly in the same order, except for ('a -> unit) -> unit that is much faster on some of the operations it supports.

in
fill (len-1) tl

let of_seq i =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be improved later, but this function might deserve a more efficient implementation which does not build the intermediate list at least for small enough arrays (e.g. keeping the first N values on the stack using a non-tail rec loop and creating the array while unwinding the recursion). Even for larger arrays, it might be more efficient to keep the intermediate values directly in arrays (either a resizing buffer, or a list of chunks). To be checked.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not, will consider it indeed.

stdlib/buffer.ml Outdated

let to_seq b =
let rec aux i () =
if i = b.position then Seq.Nil
Copy link
Copy Markdown
Contributor

@alainfrisch alainfrisch Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of Buffer.{clear, reset, truncate}, you should probably test for i >= b.position instead. (Even if mutation during iteration is specified as being undefined...)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

stdlib/buffer.ml Outdated

let to_seqi b =
let rec aux i () =
if i = b.position then Seq.Nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

stdlib/array.mli Outdated
(** {6 Iterators} *)

val to_seq : 'a array -> 'a Seq.t
(** Iterate on the array, in increasing order
Copy link
Copy Markdown
Contributor

@alainfrisch alainfrisch Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should specify that mutation during iteration is not specified (as you did for Buffer). Or, alternatively specify the behavior as you did for Bytes.

val add: 'a t -> key -> 'a -> unit
val remove: 'a t -> key -> unit
val find: 'a t -> key -> 'a
val copy : 'a t -> 'a t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason for these whitespace changes (and the recommended style would rather be the one without the whitespace).

val stats: 'a t -> statistics
end

module type FULL =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this distinction between S and FULL?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was because of the sharing of S with some ephemeron table, but I removed the sharing, so this has no reason to exist anymore indeed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c-cube Can you address this point?

buckets by size.
@since 4.00.0 *)

(** {6 Seq.iterators} *)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? (also in the doc strings below)

stdlib/map.mli Outdated
(** Iterate on the whole map, in ascending order
@since NEXT_RELEASE *)

val to_seq_at : key -> 'a t -> (key * 'a) Seq.t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very fond of the name. to_seq_starting_from or to_seq_from, perhaps?

Copy link
Copy Markdown

@paurkedal paurkedal Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to_seq_from is more descriptive than to_seq_at. But we could also incorporate more general slicing in the main sequence extractor,

val to_seq : ?first: (key -> bool) -> ?last: (key -> bool) -> 'a t -> (key * 'a) Seq.t

borrowing the semantics of find_first and find_last. While the end-of-sequence can be adjusted with a sequence combinator, direct support for it has the advantage of only holding on to the part of the container it needs plus O(log(n). Corr.: There is no such guarantee on space complexity here, since the sequence may contain a single element greater than the top-level node. The should be a gain in expected complexity for small sequence lengths.


val fold_left : ('a -> 'b -> 'a) -> 'a -> 'b t -> 'a

val iter : ('a -> unit) -> 'a t -> unit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about providing a function that memoize the iteration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is something non-trivial to do efficiently, in my experience. Using lazy is relatively slow, so I have something better (a kind of append-only buffer) in the gen library that could be used as a building block for memoize. However, this is imho the kind of function that might be more suited to a 3rd party library on opam (which could provide many more combinators and evolve faster than the stdlib).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. It might be worth adding a note to seq.mli stating that the generator must be prepared to be used in a non-linear way, so that people don't expose an overly naive generator iterating over lines of a text file.

@alainfrisch
Copy link
Copy Markdown
Contributor

I'm a bit concerned by the addition of a module to the stdlib with such a short name. One way to avoid conflicts would be to add it as a submodule of Pervasives. Another would be to use a longer name (CamlSeq) and expose an alias module Seq = CamlSeq in Pervasives. (There probably requires some tweaks to the build system.)

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Jan 12, 2017 via email

@alainfrisch
Copy link
Copy Markdown
Contributor

I wouldn't bet that sequence.ml is not used by any OCaml project (and actually, there is at least https://github.com/c-cube/sequence/blob/4e2595df8983f64cb094334f8b4e2851e3f837fa/src/sequence.ml).

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Jan 12, 2017

Indeed, I have been using Sequence as a toplevel module, it's in several projects. Maybe the Caml_seq + alias would be better (although I'm afraid when a maintainer mentions "tweaks to the build system").

@ghost
Copy link
Copy Markdown

ghost commented Jan 12, 2017

I started working on a patch to prefix all the stdlib modules and use module aliases to get the short names back. It's all working, except for a few tests where types are not printed properly and some ocamldoc errors. I should be able to finish the patch tomorrow or next week, then we'll be free to add new modules to the stdlib without guilt. The branch is here: https://github.com/diml/ocaml/tree/prefix-the-stdlib

@ghost ghost mentioned this pull request Jan 13, 2017
@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Jan 24, 2017

is it worth the trouble updating this PR now, or should I wait for #1010 to be merged?

val filter_map_inplace: (key -> 'a -> 'a option) -> 'a t -> unit
val fold : (key -> 'a -> 'b -> 'b) -> 'a t -> 'b -> 'b
val length : 'a t -> int
val stats: 'a t -> Hashtbl.statistics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inclusion of Hashtbl.S was not used just for avoiding a copy, it was also a way to force people to keep the interface of ephemerons synchronized with the one of hashtables. Could you try to add the functions to the ephemerons?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure what it means to iterate on a weak hashtable… I will certainly not do that before #1010 is merged, my quota of motivation is running dry right now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobot Do you agree with delaying the addition of Seq-related functions to ephemerons?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the work just amount to copy the version of hashtbl and apply the same modification than for iter. Lets see, I take the code of this PR, I compare Hashtbl.iter and Ephemerons.MakeSeeded.iter, and ... code written but not typed or tested:

let to_seq tbl =
  (* capture current array, so that even if the table is resized we
     keep iterating on the same array *)
  let tbl_data = tbl.data in
  (* state: index * next bucket to traverse *)
  let rec aux i buck () = match buck with
    | Empty ->
        if i = Array.length tbl_data
        then Seq.Nil
        else aux(i+1) tbl_data.(i) ()
    | Cons (_, c, next) ->
            begin match H.get_key c, H.get_data c with
            | None, _ | _, None -> ()
            | Some key, Some data ->
                 Seq.Cons ((key, data), aux i next)
            end; do_bucket rest in
        
  in
  aux 0 Empty

So I prefer if @c-cube could take this code, in his MR. I would prefer that the signature of ephemerons stay the same than the one of hashtbl.


val return : 'a -> 'a t

val map : ('a -> 'b) -> 'a t -> 'b t
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also have a >>=?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be the same as flat_map found below?

stdlib/seq.mli Outdated

val iter : ('a -> unit) -> 'a t -> unit

val memo : 'a t -> 'a t
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an imperative yield function?

let yield (x:' a Seq.t) : ('a option * 'a Sequence.t) = 
  let open Seq in match x () with 
  | Nil -> (None, empty)
  | Cons (a, r) -> (Some a, r)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean a memoizing peek or something like that? I don't really get what yield would be used for.

besides, until this has chances of being merged (i.e. #1010 merged?) I won't spend any more time on that PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for the late reply, it is more like next in Core.Sequence. But we can wait until this PR is merged and then consider this function as a useful extension. I personally used it a lot. Not sure how others feel.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Sep 29, 2017
@infinity0
Copy link
Copy Markdown
Contributor

infinity0 commented Jan 21, 2018 via email

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 14, 2018

Congratulations on getting this far!

What is (roughly) the amount of test coverage for the new functions? I only re-skimmed the patch, but I see tests for Seq.filter. Is the idea to eventually get more? (I think that basic tests for the various of_seq/to_seq functions for example could be interesting.)

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 14, 2018

I'll do the ephemeron changes and that should be good, then. Thank you @alainfrisch!

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 14, 2018

I added the basics for Ephemeron, following @bobot's advice. @gasche I added a few tests for Set and Map, and adding basic {to_seq,of_seq} hashtbl tests right now.

else aux(i+1) tbl_data.(i) ()
| Cons (_, c, next) ->
begin match H.get_key c, H.get_data c with
| None, _ | _, None -> Seq.Nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c-cube Ok, my bad I put () in my comment (here Seq.Nil) but it should be aux i next, even if the binding is not there anymore we still need to look at the other bindings.

@bobot
Copy link
Copy Markdown
Contributor

bobot commented Mar 15, 2018

I added the basics for Ephemeron, following @bobot's advice.

Thank you!

@alainfrisch
Copy link
Copy Markdown
Contributor

I fixed a conflict in Makefile, now waiting for CI green light before merging. I second @gasche's suggestion to add more tests at some point, though.

@alainfrisch alainfrisch merged commit df80f34 into ocaml:trunk Mar 16, 2018
@alainfrisch
Copy link
Copy Markdown
Contributor

Merged, just in time for the week-end! (With a squash merge, since the history did not seem so useful.)

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 16, 2018

Thank you @alainfrisch !! 😁
Now I think I should write a compatibility package for the 'a Seq.t type, and add more tests.

@alainfrisch
Copy link
Copy Markdown
Contributor

(@c-cube It occurred to me that such contribution would probably require a CLA. Have you already signed one?)

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 23, 2018

I don't think I have, but this was written on my free time, so there should be not problem. Where should I sign the CLA?

@alainfrisch
Copy link
Copy Markdown
Contributor

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 23, 2018

There isn't an online form for the CLA? 😕
This is going to take time, I don't have a printer…

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Apr 11, 2019

Question: why use Hashtbl.replace in the implementation of Hashtbl.of_seq, instead of Hashtbl.add ?

Context: In #2236 we define Hashtbl.of_list and it seems natural to use Hashtbl.add to implement it (instead of Hashtbl.replace) but it also needs to be compatible with Hashtbl.of_seq.

@damiendoligez
Copy link
Copy Markdown
Member

@c-cube : Do you have a comment on this latest question?

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Jul 31, 2019

That was a choice I made to prevent duplicates from making tables too big, but really there could have been more discussion on that particular point, I think. It would be useful to provide both, as there are good use cases for both add and replace semantics imho.

@c-cube c-cube deleted the stdlib-iterators branch October 25, 2022 16:19
stedolan pushed a commit to stedolan/ocaml that referenced this pull request Mar 21, 2023
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.