Fix FuzzySet#getEstimatedNumberUniqueValuesAllowingForCollisions to account for hashCount#14614
Merged
gsmiller merged 1 commit intoapache:mainfrom May 12, 2025
Merged
Conversation
…roperly account for hashCount
Contributor
Author
|
Related: also opened #14615 to dry up the public API a bit. |
msokolov
approved these changes
May 6, 2025
Contributor
msokolov
left a comment
There was a problem hiding this comment.
so this basically extends the API so hashCount can be supplied rather than always assumed to be 1?
Contributor
Author
That's right. (But I think we should get rid of the version that assumes 1... kept it to be back-compat for this PR) |
gsmiller
added a commit
that referenced
this pull request
May 12, 2025
…roperly account for hashCount (#14614)
weizijun
added a commit
to weizijun/lucene
that referenced
this pull request
May 16, 2025
* main: (31 commits) Fix termination condition in TestStressNRTReplication. (apache#14665) deps(java): bump com.gradle.develocity from 3.19 to 3.19.2 (apache#14662) Build: remove hard-coded Java versions from ecj.javadocs.prefs (apache#14651) Update verifier comment to show label (apache#14658) Catch and re-throw Throwable rather than using a success boolean (apache#14633) Mention label in changelog verifier comment (apache#14656) Enable PR actions in changelog verifier (apache#14644) Fix FuzzySet#getEstimatedNumberUniqueValuesAllowingForCollisions to properly account for hashCount (apache#14614) Don't perform additional KNN querying after timeout, fixes apache#14639 (apache#14640) Add instructions to help/IDEs.txt for VSCode and Neovim (apache#14646) build(deps): bump ruff from 0.11.7 to 0.11.8 in /dev-tools/scripts (apache#14603) deps(java): bump de.jflex:jflex from 1.8.2 to 1.9.1 (apache#14583) Use the preload hint on completion fields and memory terms dictionaries. (apache#14634) Clean up FileTypeHint a bit. (apache#14635) Expressions: Improve test to use a fully private class or method Remove deprecations in expressions (apache#14641) removing constructor with deprecated attribute 'onlyLongestMatch (apache#14356) Moving CHANGES entry for apache#14609 from 11.0 to 10.3 (apache#14638) Overrides rewrite in PointRangeQuery to optimize AllDocs/NoDocs cases (apache#14609) Adding benchmark for histogram collector over point range query (apache#14622) ... # Conflicts: # lucene/CHANGES.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Estimating bloom filter cardinality should account for the number of hash functions used. It appears this method assumes one function is used, which isn't correct.
Note: It also looks like the API surface area of
FuzzySetin general may be due for a cleanup. This method doesn't appear to have any direct usage today, so it's debatable whether-or-not we should deprecate this. I'm in favor of keeping it, but would probably suggest deprecating a few other methods (include the version of this I kept around that assumes one hash function). I'll publish a separate PR where we can discuss this, but I think it's worth fixing this bug for now.Testing
I verified that this method dramatically over-estimates bloom filter cardinality in its current state (by a factor approximately equal to the hash count) and verified this change corrects it. This was in ad hoc testing related to something else I'm working on. I notice there are no unit tests for
FuzzySetso I didn't add any explicit test cases for now. I can create some tests forFuzzySetas part of this if folks have a strong opinion on this. I can also add testing as a separate task when looking into cleaning up the API surface area...