Skip to content

Fix FuzzySet#getEstimatedNumberUniqueValuesAllowingForCollisions to account for hashCount#14614

Merged
gsmiller merged 1 commit intoapache:mainfrom
gsmiller:GH/FuzzySet-estimate-size-fix
May 12, 2025
Merged

Fix FuzzySet#getEstimatedNumberUniqueValuesAllowingForCollisions to account for hashCount#14614
gsmiller merged 1 commit intoapache:mainfrom
gsmiller:GH/FuzzySet-estimate-size-fix

Conversation

@gsmiller
Copy link
Contributor

@gsmiller gsmiller commented May 5, 2025

Description

Estimating bloom filter cardinality should account for the number of hash functions used. It appears this method assumes one function is used, which isn't correct.

Note: It also looks like the API surface area of FuzzySet in general may be due for a cleanup. This method doesn't appear to have any direct usage today, so it's debatable whether-or-not we should deprecate this. I'm in favor of keeping it, but would probably suggest deprecating a few other methods (include the version of this I kept around that assumes one hash function). I'll publish a separate PR where we can discuss this, but I think it's worth fixing this bug for now.

Testing

I verified that this method dramatically over-estimates bloom filter cardinality in its current state (by a factor approximately equal to the hash count) and verified this change corrects it. This was in ad hoc testing related to something else I'm working on. I notice there are no unit tests for FuzzySet so I didn't add any explicit test cases for now. I can create some tests for FuzzySet as part of this if folks have a strong opinion on this. I can also add testing as a separate task when looking into cleaning up the API surface area...

@gsmiller
Copy link
Contributor Author

gsmiller commented May 6, 2025

Related: also opened #14615 to dry up the public API a bit.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this basically extends the API so hashCount can be supplied rather than always assumed to be 1?

@gsmiller
Copy link
Contributor Author

gsmiller commented May 8, 2025

so this basically extends the API so hashCount can be supplied rather than always assumed to be 1?

That's right. (But I think we should get rid of the version that assumes 1... kept it to be back-compat for this PR)

@gsmiller gsmiller merged commit 36b3577 into apache:main May 12, 2025
7 checks passed
gsmiller added a commit that referenced this pull request May 12, 2025
weizijun added a commit to weizijun/lucene that referenced this pull request May 16, 2025
* main: (31 commits)
  Fix termination condition in TestStressNRTReplication. (apache#14665)
  deps(java): bump com.gradle.develocity from 3.19 to 3.19.2 (apache#14662)
  Build: remove hard-coded Java versions from ecj.javadocs.prefs (apache#14651)
  Update verifier comment to show label (apache#14658)
  Catch and re-throw Throwable rather than using a success boolean (apache#14633)
  Mention label in changelog verifier comment (apache#14656)
  Enable PR actions in changelog verifier (apache#14644)
  Fix FuzzySet#getEstimatedNumberUniqueValuesAllowingForCollisions to properly account for hashCount (apache#14614)
  Don't perform additional KNN querying after timeout, fixes apache#14639 (apache#14640)
  Add instructions to help/IDEs.txt for VSCode and Neovim (apache#14646)
  build(deps): bump ruff from 0.11.7 to 0.11.8 in /dev-tools/scripts (apache#14603)
  deps(java): bump de.jflex:jflex from 1.8.2 to 1.9.1 (apache#14583)
  Use the preload hint on completion fields and memory terms dictionaries. (apache#14634)
  Clean up FileTypeHint a bit. (apache#14635)
  Expressions: Improve test to use a fully private class or method
  Remove deprecations in expressions (apache#14641)
  removing constructor with deprecated attribute 'onlyLongestMatch (apache#14356)
  Moving CHANGES entry for apache#14609 from 11.0 to 10.3 (apache#14638)
  Overrides rewrite in PointRangeQuery to optimize AllDocs/NoDocs cases (apache#14609)
  Adding benchmark for histogram collector over point range query (apache#14622)
  ...

# Conflicts:
#	lucene/CHANGES.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants