Add Hashtbl.{to,of}_list by nojb · Pull Request #2236 · ocaml/ocaml

nojb · 2019-02-06T09:20:02Z

No description provided.

dra27

These functions are a great idea - they should also go in the functorial interface, I think?

stdlib/hashtbl.mli

dra27 · 2019-02-06T09:27:32Z

Note the gotchas when using add in #2202

nojb · 2019-02-06T09:40:11Z

Note the gotchas when using add in #2202

Thanks for the pointer. I added suitable (re-)definitions for Make and MakeSeeded.

alainfrisch · 2019-02-06T09:47:57Z

LGTM. Could you add unit tests?

yakobowski · 2019-02-06T09:52:45Z

I believe MoreLabels also needs to be updated. (Channeling my inner @garrigue) here.

nojb · 2019-02-06T09:55:16Z

I believe MoreLabels also needs to be updated. (Channeling my inner @garrigue) here.

Thanks, done.

nojb · 2019-02-06T10:03:07Z

LGTM. Could you add unit tests?

Done.

nojb · 2019-02-06T13:16:48Z

CI green! Seems like an easy one to review. Anyone willing to approve?

dra27 · 2019-02-06T19:17:48Z

I have one further comment, which is to do with the initial size of the Hashtbl... cf. the ?size argument I propose in #2205 for of_seq. I'm wondering if it would be worth having that optional parameter here as it would then allow the particularly fastidious user to avoid the double-scan of the list.

Regardless of that, isn't initialising the hash table to twice the length of the list likely to result in a better load factor?

nojb · 2019-02-06T20:26:03Z

I have one further comment, which is to do with the initial size of the Hashtbl... cf. the ?size argument I propose in #2205 for of_seq. I'm wondering if it would be worth having that optional parameter here as it would then allow the particularly fastidious user to avoid the double-scan of the list.

The only problem is that it breaks the idiom ... |> H.of_list which is quite common in my experience. But I won't object too strongly to it. Anyone else would like to chime in on this?

Regardless of that, isn't initialising the hash table to twice the length of the list likely to result in a better load factor?

I'm not an expert; is 2 the best factor?

nojb · 2019-04-10T10:40:05Z

This PR seems to be consensual; anyone up to approve it formally?

alainfrisch · 2019-04-10T10:55:06Z

stdlib/hashtbl.mli


+val of_list : ('a * 'b) list -> ('a, 'b) t
+(** [Hashtbl.of_list l] is a table built from the given bindings. The bindings
+    are added in the same order they appear in [l], using {!add}.


of_list uses add, of_seq uses replace. I don't know which one is better, but the inconsistency is unfortunate.

Indeed. add seems more natural to me, but I agree we should use the same function for both.

alainfrisch · 2019-04-10T10:56:06Z

stdlib/hashtbl.mli

+(** [Hashtbl.to_list h] is the list of bindings in [h]. The order in which the
+    bindings appear in the list is unspecified. However, if the table contains
+    several bindings for the same key, they appear in order of introduction,
+    that is, the most recent binding appears last.


Might be worth adding a test to check this property.

nojb · 2019-04-15T09:07:57Z

@dra27 Thanks! However, the comment above still needs resolving. Namely, Hashtbl.of_seq uses Hashtbl.replace to add elements, while Hashtbl.of_list is using Hashtbl.add.

Should we use add for both? or rather replace? Or leave it as it is (which doesn't seem a good idea).

dra27 · 2019-04-15T09:23:12Z

Oops - thanks for catching that 😊.

I agree that add is more natural, although while thinking about it further, there is an inconsistency between List.assoc - which returns the leftmost binding for any given key and Hashtbl.of_list which returns a table which returns the rightmost binding.

dra27 · 2019-04-15T09:25:49Z

Given that 4.08 includes some critical fixes to Hashtbl.of_seq, I'm tempted to cope with the breaking change of switching it to use add?

nojb · 2019-05-06T07:19:35Z

Ping. Can we reach an agreement on whether Hashtbl.add or Hashtbl.replace should be used in Hashtbl.of_list and Hashtbl.of_seq ?

dra27 · 2019-06-10T08:15:20Z

I do think there should be a consistency between List.assoc and Hashtbl.of_list - i.e. compute the list length while also reversing the list and add (using Hashtbl.add) the items so that the first binding in the list is the one you get.

If it's documented, I can live with the difference between of_seq and of_list, especially if of_list has the semantics I suggest, as converting the sequence to a list in order to reverse it first feels very weird. Maybe they're just different structures?

nojb · 2019-09-09T06:44:16Z

Let's try to converge on this issue. To summarize, we have the following options :

Switch Hashtbl.of_seq to use Hashtbl.add (BREAKING CHANGE) and use Hashtbl.add for Hashtbl.of_list as well.
Leave Hashtbl.of_seq as it is (using Hashtbl.replace), and use Hashtbl.add for Hashtbl.of_list.
Leave Hashtbl.of_seq as it is, and use Hashtbl.replace for Hashtbl.of_list so that both functions have a similar behaviour.
Leave Hashtbl.of_seq as it is, but add Hashtbl.of_seq_add and Hashtbl.of_seq_replace and similarly Hashtbl.of_list_add, Hashtbl.of_list_replace (no Hashtbl.of_list would be defined)

What do you think we should do ? @dra27 @alainfrisch @c-cube @gasche

To me it feels either 1 or 4 are probably best.

gasche · 2019-09-09T09:15:21Z

I think that @c-cube's initial design of having both add_seq and replace_seq was the right design, in fact I would say that adding of_seq there was arguably a mistake: it's easy to write add_seq (Hashtbl.make ...) which forces users to think about how to create the hashtable (seeded or not, with an accumulator size, etc.), which is the right thing to do when using something as treacherously convenient as a hash table.

I think you should add add_list and replace_list to this PR. If it was just me, I would also get rid of of_list. If you want to have it, I agree that it must make the same choice as of_seq. (Personally I would have chosen the add version for both, but I don't care strongly. I also think that breaking compatibility with e122acc (part of 4.08) is irritating but doable).

alainfrisch · 2019-09-09T11:43:17Z

I think you should add add_list and replace_list to this PR.

Would that be:

val add_list: ('a, 'b) Hashtbl.t -> ('a * 'b) list -> unit
or
val add_list: ('a, 'b) Hashtbl.t -> ('a * 'b) list -> ('a, 'b) Hashtbl.t

(1) significantly increases the verbosity of code such as let lookup l = Hashtbl.find_opt (Hashtbl.of_list l) (==> let lookup l = let tbl = Hashtbl.create 16 in Hashtbl.add_list tbl l; Hashtbl.find_opt tbl). To the point where it could almost be better to add instead a tupled version of add val Hashtbl.add_pair: ('a, 'b) Hashtbl.t -> ('a * 'b) -> unit, which could be used together with List.iter, Seq.iter, Set(X).iter, etc... [I can see coming the proposal for adding val Fun.pairify: ('a -> 'b -> unit) -> (('a * 'b) -> unit) so that one can write List.iter (Hashtbl.add tbl |> Fun.pairify)]

(2) might work better in practice, but it feels a bit ad hoc or at least foreign to current stdlib idioms (jQuery users would feel at home, though).

gasche · 2019-09-09T12:09:05Z

It would be -> unit, for consistency with {add,replace}_seq.

I think that your concern could be partially alleviated with a ('a -> unit) -> 'a -> 'a combinator in the Fun module (Fun.do_?), which would also be useful in some cases of writing composition pipelines (see for example Asmgen.pass_dump_if, which is an instance of this combinator).

I guess you write Hashtbl.find_opt (Hashtbl.of_list li) as the first example that came to mind, but wouldn't you use List.assoc in this case? (Are you subtly relying on the replace behavior to implement a sort of List.assoc_last operation?). I was going to comment on the hidden quadratic-blowup risk of hashtables and the benefit of using worst-case-guaranteed balanced maps instead; I think it's not necessarily bad to nudge people away from using hashtables in practice when they don't want to think about how they are created.

alainfrisch · 2019-09-09T12:25:28Z

think that your concern could be partially alleviated with a ('a -> unit) -> 'a -> 'a combinator

Rather ('a -> 'b -> unit) -> 'a -> 'b -> 'a, I suppose. But I'm afraid we are not encouraging people to write reader-friendly code, and far from the terseness as the original proposal.

but wouldn't you use List.assoc in this case?

The point was to preprocess the association list into a hashtable and return a fast lookup function which can be used many times. (Yes, Map might avoid a worst-case scenario, but it is more heavy for casual use, and perhaps sometimes less efficient.)

nojb · 2019-09-10T10:57:40Z

What do you think of having of_list_add and of_list_replace (and also of_seq_add and of_seq_replace) ?

gasche · 2019-09-10T12:07:05Z

I don't have strong opinions against it, but to me the problem is not only to choose add/replace, but also to pass the table-creation parameters (this was also a problem with #2202 which started by addin ?random and ?size parameters to of_seq, and then removed them again). Personally I think that duplicating the table-creation API in each of_* function is not a very nice API, and that having an API which separates creation from population is the better choice -- for data-structures that have a delicate creation ceremony.

c-cube · 2019-09-10T13:37:58Z

I stand by my original design. I very rarely use Hashtbl.add (and only when doing something like lexical scoping with shadowing), and it's very useful to be able to produce a (k,v) seq with a few lines of combinators piped into one another to finally obtain a v K_tbl.t using something like of_seq. Compare to rust with collect, and the JVM languages with builders, etc.

alainfrisch · 2019-09-10T14:15:59Z

What do you think of having of_list_add and of_list_replace (and also of_seq_add and of_seq_replace) ?

Independently of whether it is a good idea to have both, I don't like the names, which mix a functional view ("translate" from one data representation to another) and a procedural one (a verb describing how insertion is done, imperatively).

As for the idea of having both functions, I share @gasche views (also not having a strong opinion).

I stand by my original design. I very rarely use Hashtbl.add (and only when doing something like lexical scoping with shadowing)

I rarely use Hashtbl.add in context likes lexical scoping, together with Hashtbl.remove. But I defintely use Hashtbl.add quite a bit together with Hashtbl.find_all.

Again, I don't have a strong opinion, but if we had to choose only one sematics, the one based on Hashtbl.add seems a bit more natural for Hashtbl.of_list: those X.of_foo functions usually try to preserve as much of the information from the input value as possibly allowed by the output type. If an association list maps a key to multiple values, there is no reason to not represent that in the target Hashtbl. Filtering out duplicates looks more like something that should be an explicit action than something to be embedded within a "type mapping" function.

I don't think this will bring anything to the discussion, but just to mention another possible direction: allow specifying at Hashtbl creation-time a flag that ensures that calls to Hashtbl.add on that table are interpreted as Hashtbl.replace (e.g: Hashtbl.create ~without_duplicates:() 32). If we had that, and since we rarely mix Hashtbl.add/replace on the same table, one could just have one version of Hashtbl.of_list taking the table explicitly as input (and returning it). This would be slightly more verbose, but gives the full flexibility on how the table is created (seed, initial size, "without duplicates" semantics). That would be add_list: ('a, 'b) Hashtbl.t -> ('a * 'b) list -> ('a, 'b) Hashtbl.t from a previous note; I still believe this feel a bit ad hoc, but perhaps not so much.

bluddy · 2019-09-10T14:44:56Z

I agree with @c-cube, and as a side point, I think add was a mistake in the original API. It's a cause of confusion (at best). Our hashtable behaves like no other language's hashtable, and that's a problem, because when a user reaches for a hashtable, they almost certainly want replace, but unfortunately the function that will jump out at them is add, causing bugs and user confusion.

I also think the random parameter should just be true by default (both in of_list, and in general hashtable creation). This is an instance of giving they user too many options IMO. Most users won't look into the details, and use unseeded hashes when they should use seeded ones. If one really wants deterministic behavior, one can set the random seed.

The same thing applies to an initial size in the particular of_list case. If one really cares that much about the initial size for performance reasons, it is advisable to do things manually. A reasonable initial size should be fine for 95% of cases.

dra27 · 2019-09-27T09:33:21Z

I also think the random parameter should just be true by default (both in of_list, and in general hashtable creation). This is an instance of giving they user too many options IMO. Most users won't look into the details, and use unseeded hashes when they should use seeded ones. If one really wants deterministic behavior, one can set the random seed.

It's tangential to this PR, but I looked into this a while ago. The compiler doesn't build in this mode, and it was utterly non-trivial to figure out why. At some point, I'll break that branch open again, if no one else does, but there be dragons...

dra27 · 2019-09-27T09:35:48Z

@alainfrisch's idea may be a reasonable compromise?!

damiendoligez · 2020-02-25T15:34:00Z

I agree with @gasche that the API makes much more sense with add_seq / replace_seq / add_list / replace_list, notwithstanding the terseness of @alainfrisch's code (hidden currying, yuck; I'd rather have verbose and readable code).

nojb · 2020-02-25T21:49:21Z

I agree with @gasche that the API makes much more sense with add_seq / replace_seq / add_list / replace_list, notwithstanding the terseness of @alainfrisch's code (hidden currying, yuck; I'd rather have verbose and readable code).

OK, I pushed a new version adding the following functions (one "list" function for each existing "seq" function) :

val add_list : ('a, 'b) t -> ('a * 'b) list -> unit
val replace_list : ('a, 'b) t -> ('a * 'b) list -> unit
val of_list : ('a * 'b) list -> ('a, 'b) t
val to_list : ('a, 'b) t -> ('a * 'b) list
val to_list_keys : ('a, 'b) t -> 'a list
val to_list_values : ('a, 'b) t -> 'b list

where of_list has the same definition as of_seq, using replace.

What do you think? Is it worth adding these specialised functions, or do we decide that we should just use the "seq" functions + List.{to,of}_seq ?

dbuenzli · 2020-02-25T22:02:58Z

val to_list_keys : ('a, 'b) t -> 'a list
val to_list_values : ('a, 'b) t -> 'b list

Hashtbl.to_list_{keys,values} reads very weirdly. What about Hashtbl.{keys,values}_to_list or Hashtbl.{keys,values} ? Or Hashtbl.fold.

nojb · 2020-02-26T07:05:13Z

Hashtbl.to_list_{keys,values} reads very weirdly. What about Hashtbl.{keys,values}_to_list or Hashtbl.{keys,values} ? Or Hashtbl.fold.

I like Hashtbl.{keys,values} myself; I used to_list_{keys,values} to maintain uniformity with to_seq_{keys,values}.

nojb · 2020-06-12T05:15:43Z

PR seems to have stalled, even though I was under the impression that there was support, modulo naming, for its latest iteration (but I may be misremembering though). It consists in adding the following functions to Hashtbl:

val add_list : ('a, 'b) t -> ('a * 'b) list -> unit
val replace_list : ('a, 'b) t -> ('a * 'b) list -> unit
val of_list : ('a * 'b) list -> ('a, 'b) t
val to_list : ('a, 'b) t -> ('a * 'b) list
val keys : ('a, 'b) t -> 'a list
val values : ('a, 'b) t -> 'b list

Can some of the other developers that participated in the discussion chime in? Otherwise I would like to close this PR.

@dra27 @gasche @alainfrisch @dbuenzli

UPDATE: updated to_list_{keys,values} => {keys,values} which seems nicer to read.

gasche · 2020-06-12T07:46:48Z

I'm not excited by the functions proposed in this PR.

As mentioned previously, I believe that we should not introduce new functions that create a new hashtable, instead having functions that populate an existing hashtable. For this reason, I think of_list is not a good idea.
For the other functions, what they are doing is duplicating an API we have with Seq to also make it available for List. Personally I see interest in a form of minimalism where we focus on one sequential data structure for data exchange, and Seq fits the bill from my perspective. (Maybe there are performance reasons to cut the intermediate sequence, but then those should be discussed carefully, with interesting benchmarks, and maybe there are other solutions than duplicating every function of our API.)

I don't think that the functions are bad (I have a slight preference for to_list_keys rather than just keys for consistency), but we may want to make a decision on whether, in general, we want to expose both List and Seq sequence APIs, (I'm not saying that if we decide this, we have to upgrade all modules at once), and possibly other types (option, result ?).

Note: another approach would be to ensure that we can use fold functions to lift individual functions into sequence functions. For add_list that would be one of

List.fold_left Hashtbl.add_pair t li
List.fold_left_pair Hashtbl.add t li
List.fold_left (Fun.on_pair Hashtbl.add) t li

for to_list that would be one of

Hashtbl.fold_pair List.cons []
Hashtbl.fold List.cons_pair []
Hashtbl.fold (Fun.curry List.cons) []

I'm actually somewhat interested in Hashtbl.add_pair and Hashtbl.fold_pair (the general idea may be, for some functions that take several arguments, to propose both a currified and a tupled version). They have the potential to be used for many different things -- removing the temptation to go wild with point-free programming.

nojb · 2020-06-12T07:57:24Z

I'm not excited by the functions proposed in this PR.

Thanks for chiming in :)

I tend to agree with your arguments. My own experience points to of_list being particularly common and useful (much more than all the other functions) which is why I insisted on having it. Re: Seq.t while in principle it should "solve" the API issue of transferring data between different kinds of containers, in practice I don't use it much (if at all) because creating the intermediate data structure just feels wasteful. I confess I don't have a clear picture of its cost; maybe if I did I would happily make use of it :)

In any case, I think you made important points (which extend beyond this particular PR) to keep in mind for future evolutions of the stdlib.

gasche · 2020-06-12T09:23:46Z

I might propose Hashtbl.add_pair and Hashtbl.fold_pair as an independent PR.

nojb force-pushed the hashtbl_of_list branch from 4ffaa2b to 2af1d8b Compare February 6, 2019 09:21

dra27 reviewed Feb 6, 2019

View reviewed changes

stdlib/hashtbl.mli Outdated Show resolved Hide resolved

nojb force-pushed the hashtbl_of_list branch from 2af1d8b to d7afd00 Compare February 6, 2019 09:27

alainfrisch added the stdlib label Feb 6, 2019

nojb force-pushed the hashtbl_of_list branch from 2007088 to 1b34fed Compare February 6, 2019 10:04

nojb mentioned this pull request Feb 7, 2019

Add Array.map_{of,to}_list #2238

Closed

dra27 mentioned this pull request Mar 19, 2019

Hashtbl.MakeSeeded.{add,replace,of}_seq use wrong hash function #2202

Merged

nojb force-pushed the hashtbl_of_list branch from 3640ad2 to ee65f0b Compare April 10, 2019 10:39

alainfrisch reviewed Apr 10, 2019

View reviewed changes

alainfrisch mentioned this pull request Apr 10, 2019

Improve doc for Hashtbl.to_seq* #8605

Merged

nojb mentioned this pull request Apr 11, 2019

Stdlib functional iterators #1002

Merged

dra27 approved these changes Apr 15, 2019

View reviewed changes

dbuenzli mentioned this pull request Sep 15, 2019

[RFC] Add Stdlib.{In,Out}_channel #8937

Closed

alainfrisch requested a review from dra27 November 7, 2019 13:16

nojb added 3 commits February 25, 2020 22:39

Add Hashtbl.{add,replace,of}_list,to_list{,_keys,_values}

9cb89a4

Add test

1d6ceb0

Changes

c671d24

nojb force-pushed the hashtbl_of_list branch from 6091fb0 to c671d24 Compare February 25, 2020 21:40

gasche mentioned this pull request May 4, 2020

Add ?size and ?random to Hashtbl.of_seq and related functions #2205

Closed

nojb closed this Jun 12, 2020

gasche mentioned this pull request Jun 12, 2020

List.fold_assoc_{left,right} + Hashtbl.fold_pair #9671

Open

Conversation

nojb commented Feb 6, 2019

Uh oh!

dra27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dra27 commented Feb 6, 2019

Uh oh!

nojb commented Feb 6, 2019

Uh oh!

alainfrisch commented Feb 6, 2019

Uh oh!

yakobowski commented Feb 6, 2019

Uh oh!

nojb commented Feb 6, 2019

Uh oh!

nojb commented Feb 6, 2019

Uh oh!

nojb commented Feb 6, 2019

Uh oh!

dra27 commented Feb 6, 2019

Uh oh!

nojb commented Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nojb commented Apr 10, 2019

Uh oh!

alainfrisch Apr 10, 2019

Choose a reason for hiding this comment

Uh oh!

nojb Apr 10, 2019

Choose a reason for hiding this comment

Uh oh!

alainfrisch Apr 10, 2019

Choose a reason for hiding this comment

Uh oh!

nojb Apr 10, 2019

Choose a reason for hiding this comment

Uh oh!

nojb commented Apr 15, 2019

Uh oh!

dra27 commented Apr 15, 2019

Uh oh!

dra27 commented Apr 15, 2019

Uh oh!

nojb commented May 6, 2019

Uh oh!

dra27 commented Jun 10, 2019

Uh oh!

nojb commented Sep 9, 2019

Uh oh!

gasche commented Sep 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alainfrisch commented Sep 9, 2019

Uh oh!

gasche commented Sep 9, 2019

Uh oh!

alainfrisch commented Sep 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nojb commented Sep 10, 2019

Uh oh!

gasche commented Sep 10, 2019

Uh oh!

c-cube commented Sep 10, 2019

Uh oh!

alainfrisch commented Sep 10, 2019

Uh oh!

bluddy commented Sep 10, 2019

Uh oh!

dra27 commented Sep 27, 2019

Uh oh!

dra27 commented Sep 27, 2019

Uh oh!

damiendoligez commented Feb 25, 2020

Uh oh!

nojb commented Feb 25, 2020

Uh oh!

nojb commented Feb 6, 2019 •

edited

Loading

gasche commented Sep 9, 2019 •

edited

Loading

alainfrisch commented Sep 9, 2019 •

edited

Loading

nojb commented Jun 12, 2020 •

edited

Loading

gasche commented Jun 12, 2020 •

edited

Loading