Skip to content

Hashtbl.MakeSeeded.{add,replace,of}_seq use wrong hash function#2202

Merged
xavierleroy merged 2 commits intoocaml:trunkfrom
dra27:hashtbl-seq
Mar 29, 2019
Merged

Hashtbl.MakeSeeded.{add,replace,of}_seq use wrong hash function#2202
xavierleroy merged 2 commits intoocaml:trunkfrom
dra27:hashtbl-seq

Conversation

@dra27
Copy link
Copy Markdown
Member

@dra27 dra27 commented Dec 17, 2018

This GPR initially addresses a bug, which is that Hashtbl.Make.of_seq creates hash tables in randomized mode, if it has been turned on (e.g. with OCAMLRUNPARAM=R). It then goes one step further and, given that it wraps Hashtbl.create, adds the ?random parameter to Hashtbl.of_seq. of_seq also creates hash tables with a hard-coded initial size of 16, so while I was there, I added ?size. This gets propagated to Ephemeron.

The first commit should really go in 4.08 - the interface change in the second only adds optional parameters, so should also be safe to go in, even at this stage? Are we planning 4.07.2 (@damiendoligez)?

cc @c-cube, @bobot

@dra27 dra27 added the bug label Dec 17, 2018
@dra27 dra27 added this to the 4.08 milestone Dec 17, 2018
@alainfrisch
Copy link
Copy Markdown
Contributor

This GPR initially addresses a bug, which is that Hashtbl.Make.of_seq creates hash tables in randomized mode, if it has been turned on (e.g. with OCAMLRUNPARAM=R).

The current implementation is not satisfactory, but is it really buggy? In this case you mention, the seed field will be populated, but the hash function will ignore it, so this should be mostly harmless (except perhaps a small slowdown to produce the useless seed). Did you find an actual problem? (I'm not claiming the problem should not be fixed, but I'm trying to understand the extent of the issue.)

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Dec 17, 2018

It appears to be a necessary fix for the compiler, but I agree that it doesn't look like it should be fixing it. I'll dig further.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Dec 17, 2018

It's definitely a bug, but not exactly the one I'd stated. The problem is actually add_seq and replace_seq both of which should be redefined in MakeSeeded to ensure that they use the hash function from the functor, rather than default the hash function.

Further commit following - thanks @alainfrisch!

Hashtbl.MakeSeeded.{add,replace}_seq were not using the hash function
provided by the functor (Hashtbl.MakeSeeded.of_seq uses replace_seq and
so also has to be redefined locally).
Book-keeping error only - although it does potentially initialise the PRNG
unnecessarily.
@dra27 dra27 force-pushed the hashtbl-seq branch 2 times, most recently from 8166903 to b8e4c61 Compare December 17, 2018 11:51
@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Dec 17, 2018

So, this time it should pass CI. cd5f93c is very important (the functor is broken, so that's a bug), 14dee98 fixes the crime of initialising the PRNG when it's not actually needed (i.e. it's not that serious, but it's better to be correct) and then b8e4c61 is what IMO the API should look like (i.e. it's much more optional a change)

@alainfrisch
Copy link
Copy Markdown
Contributor

Ah, this was a serious bug indeed.

Two comments:

  • What about moving the toplevel definition for key_index in Hashtbl and all its dependencies after the implementation of the functor? That would avoid such bugs.

  • I'm concerned that adding optional arguments could break some code such as foo |> X.of_seq, or at least raise warning 48 ("implicit elimination of optional argument ?..."). Is it why you marked the change as breaking?

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Dec 17, 2018

For the refactor, yes agreed - although presumably better done after merging (the bug fix) since it's quite disruptive to the diff. Equally, there's a lot of duplication - is that just because the functorial interface is younger than the module itself, or is it that we need flambda for include MakeSeeded(struct ... end) to be as efficient as the duplicated definitions?

For the optional arguments, I'd marked as breaking simply because, as an interface change, it is breaking! My opinion, surprisingly at the moment for someone British, is that perhaps you cannot have your Warning 48 cake and then grumble when your use |> eats it...! There are obviously other scenarios where it breaks code - the testsuite being another example (and, indeed, Ephemeron), but I considered that this would primarily affect Hashtbl extensions (Core, Batteries, ExtLib, etc.), which would expect to have to be updated with a new version of OCaml if another function got added anyway.

I could cope with the ?size parameter being dropped, but I think the lack of ?random would not be great? Also happy to split this into two GPRs, given that the bug bit is more serious than I'd originally realised...

@alainfrisch
Copy link
Copy Markdown
Contributor

I find Warning 48 quite reasonable; without it, a simple identifier reference can end up executing code and allocating. It would be interesting to see how performance is degraded for code such as foo |> X.of_seq due to the addition of the optional arguments; does the optimizer removes the closure allocation?

Anyway, the bug fix part should obviously be accepted quickly, so splitting the PR is a good idea.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Dec 17, 2018

I wasn't arguing against warning 48, per se - it was the combination of the two features! In terms of performance, wouldn't we be micro-managing this a bit - of_seq is responsible for allocating an entire hash table with, at present, 16 potentially unused buckets, so are the closures for any optional arguments likely to be a concern?

@dra27 dra27 changed the title Add ?size and ?random to Hashtbl.of_seq and related functions Hashtbl.MakeSeeded.{add,replace,of}_seq use wrong hash function Dec 17, 2018
@alainfrisch
Copy link
Copy Markdown
Contributor

LGTM. One could merge as is if needed, but it would be better if you could add a non-regression test for that bug.

@dra27 dra27 mentioned this pull request Feb 6, 2019
@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 6, 2019

What's the status of this PR? It looks like something we want to have in 4.08 if we have it.

@XVilka
Copy link
Copy Markdown
Contributor

XVilka commented Mar 19, 2019

Will be this a part of the upcoming 4.08 release if I might ask?

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Mar 19, 2019

Rebased.

@gasche - sorry, this slipped off my radar, but both commits should really go into 4.08 (just to clarify, as this PR got split - there are no interface changes in this PR)

@alainfrisch - I just started trying to write a test, but it seems really fake here - the only way this would break would be by actively removing those functions and any regression test wouldn't guard against the additional of some future from_ and of_ functions (cf. #2236). Conversely, your suggestion of refactoring would prevent both regression and any future errors when adding similar functions - I've set myself a reminder to work on this. Are you happy for this to go in without further tests, therefore?

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Mar 19, 2019

Thanks for the ping, @XVilka!

@alainfrisch
Copy link
Copy Markdown
Contributor

Yes, ok @dra27, we really want the fix in 4.08, with or without tests.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 19, 2019

I agree with the idea of getting this in 4.08 and I think there is still time for this. (Someone needs to approve the PR.) If we don't have tests, I would personally be interested in seeing the suggested refactoring done -- I'm assuming that @alainfrisch approves now and would be willing to review the refactoring as an extra commit.

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well.

@xavierleroy xavierleroy merged commit 9474860 into ocaml:trunk Mar 29, 2019
@xavierleroy
Copy link
Copy Markdown
Contributor

Merged in trunk (commits fc8be50 and 9474860) and cherry-picked to 4.08 (commits b98f075 and d84c3aa).

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Mar 29, 2019

@xavierleroy - we have another snap! 🙂 I was just about to push the refactor commit, but I'll put in a separate PR (it's not for 4.08 anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants