Skip to content

Add Hashtbl.{to,of}_list#2236

Closed
nojb wants to merge 3 commits intoocaml:trunkfrom
nojb:hashtbl_of_list
Closed

Add Hashtbl.{to,of}_list#2236
nojb wants to merge 3 commits intoocaml:trunkfrom
nojb:hashtbl_of_list

Conversation

@nojb
Copy link
Copy Markdown
Contributor

@nojb nojb commented Feb 6, 2019

No description provided.

Copy link
Copy Markdown
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are a great idea - they should also go in the functorial interface, I think?

@dra27
Copy link
Copy Markdown
Member

dra27 commented Feb 6, 2019

Note the gotchas when using add in #2202

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Feb 6, 2019

Note the gotchas when using add in #2202

Thanks for the pointer. I added suitable (re-)definitions for Make and MakeSeeded.

@alainfrisch
Copy link
Copy Markdown
Contributor

LGTM. Could you add unit tests?

@yakobowski
Copy link
Copy Markdown
Contributor

I believe MoreLabels also needs to be updated. (Channeling my inner @garrigue) here.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Feb 6, 2019

I believe MoreLabels also needs to be updated. (Channeling my inner @garrigue) here.

Thanks, done.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Feb 6, 2019

LGTM. Could you add unit tests?

Done.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Feb 6, 2019

CI green! Seems like an easy one to review. Anyone willing to approve?

@dra27
Copy link
Copy Markdown
Member

dra27 commented Feb 6, 2019

I have one further comment, which is to do with the initial size of the Hashtbl... cf. the ?size argument I propose in #2205 for of_seq. I'm wondering if it would be worth having that optional parameter here as it would then allow the particularly fastidious user to avoid the double-scan of the list.

Regardless of that, isn't initialising the hash table to twice the length of the list likely to result in a better load factor?

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Feb 6, 2019

I have one further comment, which is to do with the initial size of the Hashtbl... cf. the ?size argument I propose in #2205 for of_seq. I'm wondering if it would be worth having that optional parameter here as it would then allow the particularly fastidious user to avoid the double-scan of the list.

The only problem is that it breaks the idiom ... |> H.of_list which is quite common in my experience. But I won't object too strongly to it. Anyone else would like to chime in on this?

Regardless of that, isn't initialising the hash table to twice the length of the list likely to result in a better load factor?

I'm not an expert; is 2 the best factor?

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Apr 10, 2019

This PR seems to be consensual; anyone up to approve it formally?


val of_list : ('a * 'b) list -> ('a, 'b) t
(** [Hashtbl.of_list l] is a table built from the given bindings. The bindings
are added in the same order they appear in [l], using {!add}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of_list uses add, of_seq uses replace. I don't know which one is better, but the inconsistency is unfortunate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. add seems more natural to me, but I agree we should use the same function for both.

(** [Hashtbl.to_list h] is the list of bindings in [h]. The order in which the
bindings appear in the list is unspecified. However, if the table contains
several bindings for the same key, they appear in order of introduction,
that is, the most recent binding appears last.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a test to check this property.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Apr 15, 2019

@dra27 Thanks! However, the comment above still needs resolving. Namely, Hashtbl.of_seq uses Hashtbl.replace to add elements, while Hashtbl.of_list is using Hashtbl.add.

Should we use add for both? or rather replace? Or leave it as it is (which doesn't seem a good idea).

@dra27
Copy link
Copy Markdown
Member

dra27 commented Apr 15, 2019

Oops - thanks for catching that 😊.

I agree that add is more natural, although while thinking about it further, there is an inconsistency between List.assoc - which returns the leftmost binding for any given key and Hashtbl.of_list which returns a table which returns the rightmost binding.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Apr 15, 2019

Given that 4.08 includes some critical fixes to Hashtbl.of_seq, I'm tempted to cope with the breaking change of switching it to use add?

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented May 6, 2019

Ping. Can we reach an agreement on whether Hashtbl.add or Hashtbl.replace should be used in Hashtbl.of_list and Hashtbl.of_seq ?

@dra27
Copy link
Copy Markdown
Member

dra27 commented Jun 10, 2019

I do think there should be a consistency between List.assoc and Hashtbl.of_list - i.e. compute the list length while also reversing the list and add (using Hashtbl.add) the items so that the first binding in the list is the one you get.

If it's documented, I can live with the difference between of_seq and of_list, especially if of_list has the semantics I suggest, as converting the sequence to a list in order to reverse it first feels very weird. Maybe they're just different structures?

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Sep 9, 2019

Let's try to converge on this issue. To summarize, we have the following options :

  1. Switch Hashtbl.of_seq to use Hashtbl.add (BREAKING CHANGE) and use Hashtbl.add for Hashtbl.of_list as well.
  2. Leave Hashtbl.of_seq as it is (using Hashtbl.replace), and use Hashtbl.add for Hashtbl.of_list.
  3. Leave Hashtbl.of_seq as it is, and use Hashtbl.replace for Hashtbl.of_list so that both functions have a similar behaviour.
  4. Leave Hashtbl.of_seq as it is, but add Hashtbl.of_seq_add and Hashtbl.of_seq_replace and similarly Hashtbl.of_list_add, Hashtbl.of_list_replace (no Hashtbl.of_list would be defined)

What do you think we should do ? @dra27 @alainfrisch @c-cube @gasche

To me it feels either 1 or 4 are probably best.

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 9, 2019

I think that @c-cube's initial design of having both add_seq and replace_seq was the right design, in fact I would say that adding of_seq there was arguably a mistake: it's easy to write add_seq (Hashtbl.make ...) which forces users to think about how to create the hashtable (seeded or not, with an accumulator size, etc.), which is the right thing to do when using something as treacherously convenient as a hash table.

I think you should add add_list and replace_list to this PR. If it was just me, I would also get rid of of_list. If you want to have it, I agree that it must make the same choice as of_seq. (Personally I would have chosen the add version for both, but I don't care strongly. I also think that breaking compatibility with e122acc (part of 4.08) is irritating but doable).

@alainfrisch
Copy link
Copy Markdown
Contributor

I think you should add add_list and replace_list to this PR.

Would that be:

  1. val add_list: ('a, 'b) Hashtbl.t -> ('a * 'b) list -> unit
    or
  2. val add_list: ('a, 'b) Hashtbl.t -> ('a * 'b) list -> ('a, 'b) Hashtbl.t

(1) significantly increases the verbosity of code such as let lookup l = Hashtbl.find_opt (Hashtbl.of_list l) (==> let lookup l = let tbl = Hashtbl.create 16 in Hashtbl.add_list tbl l; Hashtbl.find_opt tbl). To the point where it could almost be better to add instead a tupled version of add val Hashtbl.add_pair: ('a, 'b) Hashtbl.t -> ('a * 'b) -> unit, which could be used together with List.iter, Seq.iter, Set(X).iter, etc... [I can see coming the proposal for adding val Fun.pairify: ('a -> 'b -> unit) -> (('a * 'b) -> unit) so that one can write List.iter (Hashtbl.add tbl |> Fun.pairify)]

(2) might work better in practice, but it feels a bit ad hoc or at least foreign to current stdlib idioms (jQuery users would feel at home, though).

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 9, 2019

It would be -> unit, for consistency with {add,replace}_seq.

I think that your concern could be partially alleviated with a ('a -> unit) -> 'a -> 'a combinator in the Fun module (Fun.do_?), which would also be useful in some cases of writing composition pipelines (see for example Asmgen.pass_dump_if, which is an instance of this combinator).

I guess you write Hashtbl.find_opt (Hashtbl.of_list li) as the first example that came to mind, but wouldn't you use List.assoc in this case? (Are you subtly relying on the replace behavior to implement a sort of List.assoc_last operation?). I was going to comment on the hidden quadratic-blowup risk of hashtables and the benefit of using worst-case-guaranteed balanced maps instead; I think it's not necessarily bad to nudge people away from using hashtables in practice when they don't want to think about how they are created.

@alainfrisch
Copy link
Copy Markdown
Contributor

alainfrisch commented Sep 9, 2019

think that your concern could be partially alleviated with a ('a -> unit) -> 'a -> 'a combinator

Rather ('a -> 'b -> unit) -> 'a -> 'b -> 'a, I suppose. But I'm afraid we are not encouraging people to write reader-friendly code, and far from the terseness as the original proposal.

but wouldn't you use List.assoc in this case?

The point was to preprocess the association list into a hashtable and return a fast lookup function which can be used many times. (Yes, Map might avoid a worst-case scenario, but it is more heavy for casual use, and perhaps sometimes less efficient.)

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Sep 10, 2019

What do you think of having of_list_add and of_list_replace (and also of_seq_add and of_seq_replace) ?

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 10, 2019

I don't have strong opinions against it, but to me the problem is not only to choose add/replace, but also to pass the table-creation parameters (this was also a problem with #2202 which started by addin ?random and ?size parameters to of_seq, and then removed them again). Personally I think that duplicating the table-creation API in each of_* function is not a very nice API, and that having an API which separates creation from population is the better choice -- for data-structures that have a delicate creation ceremony.

@c-cube
Copy link
Copy Markdown
Contributor

c-cube commented Sep 10, 2019

I stand by my original design. I very rarely use Hashtbl.add (and only when doing something like lexical scoping with shadowing), and it's very useful to be able to produce a (k,v) seq with a few lines of combinators piped into one another to finally obtain a v K_tbl.t using something like of_seq. Compare to rust with collect, and the JVM languages with builders, etc.

@alainfrisch
Copy link
Copy Markdown
Contributor

What do you think of having of_list_add and of_list_replace (and also of_seq_add and of_seq_replace) ?

Independently of whether it is a good idea to have both, I don't like the names, which mix a functional view ("translate" from one data representation to another) and a procedural one (a verb describing how insertion is done, imperatively).

As for the idea of having both functions, I share @gasche views (also not having a strong opinion).

I stand by my original design. I very rarely use Hashtbl.add (and only when doing something like lexical scoping with shadowing)

I rarely use Hashtbl.add in context likes lexical scoping, together with Hashtbl.remove. But I defintely use Hashtbl.add quite a bit together with Hashtbl.find_all.

Again, I don't have a strong opinion, but if we had to choose only one sematics, the one based on Hashtbl.add seems a bit more natural for Hashtbl.of_list: those X.of_foo functions usually try to preserve as much of the information from the input value as possibly allowed by the output type. If an association list maps a key to multiple values, there is no reason to not represent that in the target Hashtbl. Filtering out duplicates looks more like something that should be an explicit action than something to be embedded within a "type mapping" function.

I don't think this will bring anything to the discussion, but just to mention another possible direction: allow specifying at Hashtbl creation-time a flag that ensures that calls to Hashtbl.add on that table are interpreted as Hashtbl.replace (e.g: Hashtbl.create ~without_duplicates:() 32). If we had that, and since we rarely mix Hashtbl.add/replace on the same table, one could just have one version of Hashtbl.of_list taking the table explicitly as input (and returning it). This would be slightly more verbose, but gives the full flexibility on how the table is created (seed, initial size, "without duplicates" semantics). That would be add_list: ('a, 'b) Hashtbl.t -> ('a * 'b) list -> ('a, 'b) Hashtbl.t from a previous note; I still believe this feel a bit ad hoc, but perhaps not so much.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Sep 10, 2019

I agree with @c-cube, and as a side point, I think add was a mistake in the original API. It's a cause of confusion (at best). Our hashtable behaves like no other language's hashtable, and that's a problem, because when a user reaches for a hashtable, they almost certainly want replace, but unfortunately the function that will jump out at them is add, causing bugs and user confusion.

I also think the random parameter should just be true by default (both in of_list, and in general hashtable creation). This is an instance of giving they user too many options IMO. Most users won't look into the details, and use unseeded hashes when they should use seeded ones. If one really wants deterministic behavior, one can set the random seed.

The same thing applies to an initial size in the particular of_list case. If one really cares that much about the initial size for performance reasons, it is advisable to do things manually. A reasonable initial size should be fine for 95% of cases.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Sep 27, 2019

I also think the random parameter should just be true by default (both in of_list, and in general hashtable creation). This is an instance of giving they user too many options IMO. Most users won't look into the details, and use unseeded hashes when they should use seeded ones. If one really wants deterministic behavior, one can set the random seed.

It's tangential to this PR, but I looked into this a while ago. The compiler doesn't build in this mode, and it was utterly non-trivial to figure out why. At some point, I'll break that branch open again, if no one else does, but there be dragons...

@dra27
Copy link
Copy Markdown
Member

dra27 commented Sep 27, 2019

@alainfrisch's idea may be a reasonable compromise?!

@alainfrisch alainfrisch requested a review from dra27 November 7, 2019 13:16
@damiendoligez
Copy link
Copy Markdown
Member

I agree with @gasche that the API makes much more sense with add_seq / replace_seq / add_list / replace_list, notwithstanding the terseness of @alainfrisch's code (hidden currying, yuck; I'd rather have verbose and readable code).

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Feb 25, 2020

I agree with @gasche that the API makes much more sense with add_seq / replace_seq / add_list / replace_list, notwithstanding the terseness of @alainfrisch's code (hidden currying, yuck; I'd rather have verbose and readable code).

OK, I pushed a new version adding the following functions (one "list" function for each existing "seq" function) :

val add_list : ('a, 'b) t -> ('a * 'b) list -> unit
val replace_list : ('a, 'b) t -> ('a * 'b) list -> unit
val of_list : ('a * 'b) list -> ('a, 'b) t
val to_list : ('a, 'b) t -> ('a * 'b) list
val to_list_keys : ('a, 'b) t -> 'a list
val to_list_values : ('a, 'b) t -> 'b list

where of_list has the same definition as of_seq, using replace.

What do you think? Is it worth adding these specialised functions, or do we decide that we should just use the "seq" functions + List.{to,of}_seq ?

@dbuenzli
Copy link
Copy Markdown
Contributor

val to_list_keys : ('a, 'b) t -> 'a list
val to_list_values : ('a, 'b) t -> 'b list

Hashtbl.to_list_{keys,values} reads very weirdly. What about Hashtbl.{keys,values}_to_list or Hashtbl.{keys,values} ? Or Hashtbl.fold.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Feb 26, 2020

Hashtbl.to_list_{keys,values} reads very weirdly. What about Hashtbl.{keys,values}_to_list or Hashtbl.{keys,values} ? Or Hashtbl.fold.

I like Hashtbl.{keys,values} myself; I used to_list_{keys,values} to maintain uniformity with to_seq_{keys,values}.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Jun 12, 2020

PR seems to have stalled, even though I was under the impression that there was support, modulo naming, for its latest iteration (but I may be misremembering though). It consists in adding the following functions to Hashtbl:

val add_list : ('a, 'b) t -> ('a * 'b) list -> unit
val replace_list : ('a, 'b) t -> ('a * 'b) list -> unit
val of_list : ('a * 'b) list -> ('a, 'b) t
val to_list : ('a, 'b) t -> ('a * 'b) list
val keys : ('a, 'b) t -> 'a list
val values : ('a, 'b) t -> 'b list

Can some of the other developers that participated in the discussion chime in? Otherwise I would like to close this PR.

@dra27 @gasche @alainfrisch @dbuenzli

UPDATE: updated to_list_{keys,values} => {keys,values} which seems nicer to read.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jun 12, 2020

I'm not excited by the functions proposed in this PR.

  • As mentioned previously, I believe that we should not introduce new functions that create a new hashtable, instead having functions that populate an existing hashtable. For this reason, I think of_list is not a good idea.

  • For the other functions, what they are doing is duplicating an API we have with Seq to also make it available for List. Personally I see interest in a form of minimalism where we focus on one sequential data structure for data exchange, and Seq fits the bill from my perspective. (Maybe there are performance reasons to cut the intermediate sequence, but then those should be discussed carefully, with interesting benchmarks, and maybe there are other solutions than duplicating every function of our API.)

    I don't think that the functions are bad (I have a slight preference for to_list_keys rather than just keys for consistency), but we may want to make a decision on whether, in general, we want to expose both List and Seq sequence APIs, (I'm not saying that if we decide this, we have to upgrade all modules at once), and possibly other types (option, result ?).

Note: another approach would be to ensure that we can use fold functions to lift individual functions into sequence functions. For add_list that would be one of

List.fold_left Hashtbl.add_pair t li
List.fold_left_pair Hashtbl.add t li
List.fold_left (Fun.on_pair Hashtbl.add) t li

for to_list that would be one of

Hashtbl.fold_pair List.cons []
Hashtbl.fold List.cons_pair []
Hashtbl.fold (Fun.curry List.cons) []

I'm actually somewhat interested in Hashtbl.add_pair and Hashtbl.fold_pair (the general idea may be, for some functions that take several arguments, to propose both a currified and a tupled version). They have the potential to be used for many different things -- removing the temptation to go wild with point-free programming.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Jun 12, 2020

I'm not excited by the functions proposed in this PR.

Thanks for chiming in :)

I tend to agree with your arguments. My own experience points to of_list being particularly common and useful (much more than all the other functions) which is why I insisted on having it. Re: Seq.t while in principle it should "solve" the API issue of transferring data between different kinds of containers, in practice I don't use it much (if at all) because creating the intermediate data structure just feels wasteful. I confess I don't have a clear picture of its cost; maybe if I did I would happily make use of it :)

In any case, I think you made important points (which extend beyond this particular PR) to keep in mind for future evolutions of the stdlib.

@nojb nojb closed this Jun 12, 2020
@gasche
Copy link
Copy Markdown
Member

gasche commented Jun 12, 2020

I might propose Hashtbl.add_pair and Hashtbl.fold_pair as an independent PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants