Stdlib functional iterators#1002
Conversation
There was a problem hiding this comment.
I generally think this approach looks good; since this is meant as a generic way to get out data from containers, I prefer it simple and performant rather than abstract (which I agree would be needed) and featureful. Though, I don't know how the performance compares to the previous PR.
stdlib/StdlibModules
Outdated
| hashtbl \ | ||
| int32 \ | ||
| int64 \ | ||
| iter \ |
|
|
||
| let of_seq i = | ||
| let n = ref 0 in | ||
| let buf = ref (make 256 '\000') in |
There was a problem hiding this comment.
Would be simpler to use Buffer, since geometric resizing is used anyway. (But good you avoid double iteration, since we don't know the complexity of traversing the input.)
There was a problem hiding this comment.
Yes, relying on Buffer would be more convenient, but it would create circular dependencies since Buffer already depends on Bytes.
There was a problem hiding this comment.
Of course. Your partial reimplementation is quite lean anyway.
stdlib/hashtbl.ml
Outdated
| max_bucket_length = mbl; | ||
| bucket_histogram = histo } | ||
|
|
||
| (** {6 Seq.tors} *) |
stdlib/map.mli
Outdated
| in ascending order, from key [k] or above. | ||
| @since NEXT_RELEASE *) | ||
|
|
||
| val add_seq : 'a t -> (key * 'a) Seq.t -> 'a t |
There was a problem hiding this comment.
val add_seq : (key * 'a) Seq.t -> 'a t -> 'a t for consistency with add and to make the partial application composable.
stdlib/set.mli
Outdated
| (** Iterate on the whole set, in ascending order | ||
| @since NEXT_RELEASE *) | ||
|
|
||
| val add_seq : t -> elt Seq.t -> t |
There was a problem hiding this comment.
add_seq : elt Seq.t -> t -> t, cf comment on Map.add_seq.
There was a problem hiding this comment.
fixed in last commit
|
There were some basic performance comparisons in #635. As far as |
| in | ||
| fill (len-1) tl | ||
|
|
||
| let of_seq i = |
There was a problem hiding this comment.
This can be improved later, but this function might deserve a more efficient implementation which does not build the intermediate list at least for small enough arrays (e.g. keeping the first N values on the stack using a non-tail rec loop and creating the array while unwinding the recursion). Even for larger arrays, it might be more efficient to keep the intermediate values directly in arrays (either a resizing buffer, or a list of chunks). To be checked.
There was a problem hiding this comment.
Why not, will consider it indeed.
stdlib/buffer.ml
Outdated
|
|
||
| let to_seq b = | ||
| let rec aux i () = | ||
| if i = b.position then Seq.Nil |
There was a problem hiding this comment.
Because of Buffer.{clear, reset, truncate}, you should probably test for i >= b.position instead. (Even if mutation during iteration is specified as being undefined...)
stdlib/buffer.ml
Outdated
|
|
||
| let to_seqi b = | ||
| let rec aux i () = | ||
| if i = b.position then Seq.Nil |
stdlib/array.mli
Outdated
| (** {6 Iterators} *) | ||
|
|
||
| val to_seq : 'a array -> 'a Seq.t | ||
| (** Iterate on the array, in increasing order |
There was a problem hiding this comment.
You should specify that mutation during iteration is not specified (as you did for Buffer). Or, alternatively specify the behavior as you did for Bytes.
stdlib/hashtbl.ml
Outdated
| val add: 'a t -> key -> 'a -> unit | ||
| val remove: 'a t -> key -> unit | ||
| val find: 'a t -> key -> 'a | ||
| val copy : 'a t -> 'a t |
There was a problem hiding this comment.
No reason for these whitespace changes (and the recommended style would rather be the one without the whitespace).
stdlib/hashtbl.ml
Outdated
| val stats: 'a t -> statistics | ||
| end | ||
|
|
||
| module type FULL = |
There was a problem hiding this comment.
Why do we need this distinction between S and FULL?
There was a problem hiding this comment.
that was because of the sharing of S with some ephemeron table, but I removed the sharing, so this has no reason to exist anymore indeed.
stdlib/hashtbl.mli
Outdated
| buckets by size. | ||
| @since 4.00.0 *) | ||
|
|
||
| (** {6 Seq.iterators} *) |
There was a problem hiding this comment.
Typo? (also in the doc strings below)
stdlib/map.mli
Outdated
| (** Iterate on the whole map, in ascending order | ||
| @since NEXT_RELEASE *) | ||
|
|
||
| val to_seq_at : key -> 'a t -> (key * 'a) Seq.t |
There was a problem hiding this comment.
I'm not very fond of the name. to_seq_starting_from or to_seq_from, perhaps?
There was a problem hiding this comment.
I agree to_seq_from is more descriptive than to_seq_at. But we could also incorporate more general slicing in the main sequence extractor,
val to_seq : ?first: (key -> bool) -> ?last: (key -> bool) -> 'a t -> (key * 'a) Seq.tborrowing the semantics of find_first and find_last. While the end-of-sequence can be adjusted with a sequence combinator, direct support for it has the advantage of only holding on to the part of the container it needs plus O(log(n). Corr.: There is no such guarantee on space complexity here, since the sequence may contain a single element greater than the top-level node. The should be a gain in expected complexity for small sequence lengths.
|
|
||
| val fold_left : ('a -> 'b -> 'a) -> 'a -> 'b t -> 'a | ||
|
|
||
| val iter : ('a -> unit) -> 'a t -> unit |
There was a problem hiding this comment.
What about providing a function that memoize the iteration?
There was a problem hiding this comment.
That is something non-trivial to do efficiently, in my experience. Using lazy is relatively slow, so I have something better (a kind of append-only buffer) in the gen library that could be used as a building block for memoize. However, this is imho the kind of function that might be more suited to a 3rd party library on opam (which could provide many more combinators and evolve faster than the stdlib).
There was a problem hiding this comment.
Fair enough. It might be worth adding a note to seq.mli stating that the generator must be prepared to be used in a non-linear way, so that people don't expose an overly naive generator iterating over lines of a text file.
|
I'm a bit concerned by the addition of a module to the stdlib with such a short name. One way to avoid conflicts would be to add it as a submodule of Pervasives. Another would be to use a longer name (CamlSeq) and expose an alias |
|
How about Sequence?
|
|
I wouldn't bet that sequence.ml is not used by any OCaml project (and actually, there is at least https://github.com/c-cube/sequence/blob/4e2595df8983f64cb094334f8b4e2851e3f837fa/src/sequence.ml). |
|
Indeed, I have been using |
|
I started working on a patch to prefix all the stdlib modules and use module aliases to get the short names back. It's all working, except for a few tests where types are not printed properly and some ocamldoc errors. I should be able to finish the patch tomorrow or next week, then we'll be free to add new modules to the stdlib without guilt. The branch is here: https://github.com/diml/ocaml/tree/prefix-the-stdlib |
|
is it worth the trouble updating this PR now, or should I wait for #1010 to be merged? |
18970e4 to
af64940
Compare
stdlib/ephemeron.ml
Outdated
| val filter_map_inplace: (key -> 'a -> 'a option) -> 'a t -> unit | ||
| val fold : (key -> 'a -> 'b -> 'b) -> 'a t -> 'b -> 'b | ||
| val length : 'a t -> int | ||
| val stats: 'a t -> Hashtbl.statistics |
There was a problem hiding this comment.
The inclusion of Hashtbl.S was not used just for avoiding a copy, it was also a way to force people to keep the interface of ephemerons synchronized with the one of hashtables. Could you try to add the functions to the ephemerons?
There was a problem hiding this comment.
I'm not even sure what it means to iterate on a weak hashtable… I will certainly not do that before #1010 is merged, my quota of motivation is running dry right now.
There was a problem hiding this comment.
@bobot Do you agree with delaying the addition of Seq-related functions to ephemerons?
There was a problem hiding this comment.
I think the work just amount to copy the version of hashtbl and apply the same modification than for iter. Lets see, I take the code of this PR, I compare Hashtbl.iter and Ephemerons.MakeSeeded.iter, and ... code written but not typed or tested:
let to_seq tbl =
(* capture current array, so that even if the table is resized we
keep iterating on the same array *)
let tbl_data = tbl.data in
(* state: index * next bucket to traverse *)
let rec aux i buck () = match buck with
| Empty ->
if i = Array.length tbl_data
then Seq.Nil
else aux(i+1) tbl_data.(i) ()
| Cons (_, c, next) ->
begin match H.get_key c, H.get_data c with
| None, _ | _, None -> ()
| Some key, Some data ->
Seq.Cons ((key, data), aux i next)
end; do_bucket rest in
in
aux 0 Empty
So I prefer if @c-cube could take this code, in his MR. I would prefer that the signature of ephemerons stay the same than the one of hashtbl.
|
|
||
| val return : 'a -> 'a t | ||
|
|
||
| val map : ('a -> 'b) -> 'a t -> 'b t |
There was a problem hiding this comment.
Wouldn't that be the same as flat_map found below?
stdlib/seq.mli
Outdated
|
|
||
| val iter : ('a -> unit) -> 'a t -> unit | ||
|
|
||
| val memo : 'a t -> 'a t |
There was a problem hiding this comment.
Should we have an imperative yield function?
let yield (x:' a Seq.t) : ('a option * 'a Sequence.t) =
let open Seq in match x () with
| Nil -> (None, empty)
| Cons (a, r) -> (Some a, r)There was a problem hiding this comment.
you mean a memoizing peek or something like that? I don't really get what yield would be used for.
besides, until this has chances of being merged (i.e. #1010 merged?) I won't spend any more time on that PR.
There was a problem hiding this comment.
Hi, sorry for the late reply, it is more like next in Core.Sequence. But we can wait until this PR is merged and then consider this function as a useful extension. I personally used it a lot. Not sure how others feel.
|
nicole mazzuca:
I really dislike the impure nature of iterators implemented with this approach - I'd much prefer a "state and function" approach to iteration. Has something like that been considered? Why has it been rejected?
(see https://github.com/ubsan/pred/blob/master/pred/iter.ml for an example)
In case no-one answered this - I believe the point is that the state is implicitly part of the closures returned by subsequent calls to the first closure. When iterating it's generally a bit pointless to have the consumer deal with this state themselves, they don't care about it and only care about the output values. The producer has to track it anyway, so they might as well hide it from the consumer.
In summary, this is more like Haskell's lazy lists rather than a state monad. Of course nothing prevents you from implementing an iterator as a state monad, but using them would be more awkward.
|
|
Congratulations on getting this far! What is (roughly) the amount of test coverage for the new functions? I only re-skimmed the patch, but I see tests for |
|
I'll do the ephemeron changes and that should be good, then. Thank you @alainfrisch! |
stdlib/ephemeron.ml
Outdated
| else aux(i+1) tbl_data.(i) () | ||
| | Cons (_, c, next) -> | ||
| begin match H.get_key c, H.get_data c with | ||
| | None, _ | _, None -> Seq.Nil |
There was a problem hiding this comment.
@c-cube Ok, my bad I put () in my comment (here Seq.Nil) but it should be aux i next, even if the binding is not there anymore we still need to look at the other bindings.
Thank you! |
|
I fixed a conflict in Makefile, now waiting for CI green light before merging. I second @gasche's suggestion to add more tests at some point, though. |
|
Merged, just in time for the week-end! (With a squash merge, since the history did not seem so useful.) |
|
Thank you @alainfrisch !! 😁 |
|
(@c-cube It occurred to me that such contribution would probably require a CLA. Have you already signed one?) |
|
I don't think I have, but this was written on my free time, so there should be not problem. Where should I sign the CLA? |
|
There isn't an online form for the CLA? 😕 |
|
Question: why use Context: In #2236 we define |
|
@c-cube : Do you have a comment on this latest question? |
|
That was a choice I made to prevent duplicates from making tables too big, but really there could have been more discussion on that particular point, I think. It would be useful to provide both, as there are good use cases for both add and replace semantics imho. |
See #635 for previous discussion and design.
This new proposal uses a purely functional type of delayed lists, as suggested by @fpottier
and already implemented in Batteries. The module is named
Seq.The type is not structural, sadly (using
type 'a t = unit -> ('a * 'a t) optiondoes not pass the acyclicity check), but is very simple to use, and has decent performances (see previous PR). Ping @Drup @alainfrisch @gasche @jakemcarthur @paurkedal for more discussions.About the choice of
unit -> 'a noderather than'a node lazy_t: it is much faster if the iterator is traversed once, and it entails fewer memory leaks (e.g. forrange 0 1_000_000, there will be no memoization). One can still write a memoization combinator (I wrote one forgenwhich is very efficient, using an underlying buffer).