Skip to content

Add LSHBloom to Datasketch#246

Merged
ekzhu merged 29 commits intoekzhu:masterfrom
123epsilon:lsh_bloom
Nov 5, 2025
Merged

Add LSHBloom to Datasketch#246
ekzhu merged 29 commits intoekzhu:masterfrom
123epsilon:lsh_bloom

Conversation

@123epsilon
Copy link
Copy Markdown
Contributor

This PR adds the LSHBloom algorithm to datasketch along with tests, docs, and a simple benchmark that compares LSH to LSHBloom across various values of the new parameter fp, which controls the false positive rate of the Bloom filter index that LSHBloom uses to save space.

@ekzhu

Copy link
Copy Markdown
Owner

@ekzhu ekzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! For the test failures, could you see if it's due to missing dependencies?

@123epsilon
Copy link
Copy Markdown
Contributor Author

Looks great! For the test failures, could you see if it's due to missing dependencies?

This is a bit confusing, but it seems that there is an issue with CPython and pybloomfiltermmap3 which I haven't been able to reproduce (pip installing on redhat, macos, windows seems to always work for me).

I see that this issue has cropped up for some version fo python before: prashnts/pybloomfiltermmap3#55

There is now another fork that's maintained with 3.12+, so I'll update the deps to use that one instead.

@123epsilon
Copy link
Copy Markdown
Contributor Author

123epsilon commented Jun 4, 2025

Also, @ekzhu there seems to be an issue with the test CI (not sure if this related to this PR):

In build test:

Version 3.7 was not found in the local cache
  Error: The version '3.7' with architecture 'x64' was not found for Ubuntu 24.04.
  The list of all available versions can be found here: https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json

It seems like these tests fail because they are out of date with some github action? Based on the setup-python manifest, I think if we downgrade the runner to Ubuntu 22.04 these tests might execute.

That was also a solution given by someone else who encountered this issue: actions/setup-python#962 (comment)

I can try this and add to my PR so we have a working CI - does it seem like an acceptable solution to diverge from 'ubuntu-latest' in the CI?

@123epsilon
Copy link
Copy Markdown
Contributor Author

There seems to be one final issue with the pyproject.toml for the new bloom filter dependency - will reach out to fix it

@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Jun 6, 2025

Thanks for the updates! The fixes to the GitHub action makes sense. I can upgrade the action in the future when it becomes clear no body is using Python 3.7.

https://devguide.python.org/versions/

@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Jun 19, 2025

Some notes from copilot on the failing job:

The failure is due to an invalid configuration in the pyproject.toml file of the pybloomfilter3 dependency:

Error:
configuration error: project.license must be valid exactly by one definition (2 matches found)

This means the project.license field in pybloomfilter3's pyproject.toml is formatted incorrectly (e.g., both a string and a table are set, or the structure is not as expected by build tools).

Solution options:

  1. Use a Fixed Version (if available)
  • If there is a newer pybloomfilter3 version that fixes the pyproject.toml, update your dependency to use that version.
  1. Patch the Dependency
  • Fork pybloomfilter3, fix the pyproject.toml (make sure project.license is valid—use either:
    project.license = "MIT"
    or
    [project.license]
    text = "MIT"
    depending on the expected format), and use your fork in requirements.
  1. Pin an Older Version
  • If a previously working version exists, pin to that until upstream is fixed.

Example: To patch pyproject.toml in your fork, change:

project.license = { text = "MIT", file = "LICENSE" }

to just:

[project.license]
text = "MIT"

Or if the build expects a string:

project.license = "MIT"

Update your requirements or pyproject.toml to use your fixed version or fork. If you need help editing your workflow or requirements files, let me know which approach you want to use!

@123epsilon
Copy link
Copy Markdown
Contributor Author

@ekzhu This is wierd, python -m pytest test/test_minhash.py passes on my local, but it seems that one of the bBitMinHash tests is failing? Specifically the pickle one:

test/test_lshbloom.py::TestMinHashLSHBloom::test_init
test/test_lshbloom.py::TestMinHashLSHBloom::test_insert
test/test_lshbloom.py::TestMinHashLSHBloom::test_query
  /home/runner/work/datasketch/datasketch/datasketch/lsh_bloom.py:197: RuntimeWarning: Creating LSHBloom index without save directory, this index will not be persisted.
    warnings.warn("Creating LSHBloom index without save directory, this index will not be persisted.", RuntimeWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED test/test_minhash.py::TestbBitMinHash::test_pickle - AssertionError: <datasketch.b_bit_minhash.bBitMinHash object at 0x7ff295716c00> != <datasketch.b_bit_minhash.bBitMinHash object at 0x7ff295716940>
============ 1 failed, 154 passed, 31 skipped, 8 warnings in 34.57s ============

@123epsilon
Copy link
Copy Markdown
Contributor Author

Update @ekzhu I finally fixed it! I made a separate PR here: #248 to address the pickling issue. This was a subtle one. It already passes all the tests and its only a few lines so should be straight forward to review :)

That PR directly blocks this one, so once its approved and I rebase here this branch should pass all tests as well.

@123epsilon 123epsilon requested a review from ekzhu September 15, 2025 21:26
@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Nov 4, 2025

Thanks @123epsilon.

Looks like there is still some issue with the pyproject.toml file.

Also, are you interested in becoming a collaborator? As you have probably noticed, I need help with reviewing PR and maintaining the project. See #252

@123epsilon
Copy link
Copy Markdown
Contributor Author

@ekzhu Yes I see, I think some upstream changes to pybloomfilter may have reverted the changes I made to the pyproject.toml - I'll fix those up so we can wrap up this PR.

And yes, I'd love to be a collaborator on this repo - I see that alot of people have expressed interest in the things you'd mentioned in #252 above. When I have some free time I could consider looking into GPU acceleration - particularly for MinHashing which in my experience is the major bottleneck in most of my workflows. Additionally - how would you feel about the possibility of a Rust backend? No promises on timeline there, but I've noticed that Rust tends to be blazing fast and simple to parallelize for the kinds of computations we're doing here.

@123epsilon
Copy link
Copy Markdown
Contributor Author

Ok @ekzhu, they dropped support for <3.9 in the latest version of the Bloom filter library we were using so I pinned to an earlier commit which should be fine for the forseeable future (the incompatibility seems to mainly be in build system rather than code - though there are some Cython considerations).

CI is green :)

@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Nov 4, 2025

how would you feel about the possibility of a Rust backend?

Interesting idea! I am curious how things are going to integrate together. Maybe you can participate in the discussion there.

@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Nov 5, 2025

@123epsilon looks like there is some conflict -- I resolved them

@ekzhu ekzhu merged commit a2a4db2 into ekzhu:master Nov 5, 2025
6 checks passed
@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Nov 27, 2025

@123epsilon do you have an x.com and/ or LinkedIn handle? I would like to post about your contribution.

@123epsilon
Copy link
Copy Markdown
Contributor Author

@123epsilon do you have an x.com and/ or LinkedIn handle? I would like to post about your contribution.

That would be awesome! I see you've found my linkedin: https://www.linkedin.com/in/arhammkhan/

I have an X acount here: https://x.com/arhamk1216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants