Conversation
ekzhu
left a comment
There was a problem hiding this comment.
Looks great! For the test failures, could you see if it's due to missing dependencies?
This is a bit confusing, but it seems that there is an issue with CPython and I see that this issue has cropped up for some version fo python before: prashnts/pybloomfiltermmap3#55 There is now another fork that's maintained with 3.12+, so I'll update the deps to use that one instead. |
|
Also, @ekzhu there seems to be an issue with the test CI (not sure if this related to this PR): In build test: It seems like these tests fail because they are out of date with some github action? Based on the setup-python manifest, I think if we downgrade the runner to Ubuntu That was also a solution given by someone else who encountered this issue: actions/setup-python#962 (comment) I can try this and add to my PR so we have a working CI - does it seem like an acceptable solution to diverge from 'ubuntu-latest' in the CI? |
|
There seems to be one final issue with the |
|
Thanks for the updates! The fixes to the GitHub action makes sense. I can upgrade the action in the future when it becomes clear no body is using Python 3.7. |
|
Some notes from copilot on the failing job:
|
|
@ekzhu This is wierd, |
|
Update @ekzhu I finally fixed it! I made a separate PR here: #248 to address the pickling issue. This was a subtle one. It already passes all the tests and its only a few lines so should be straight forward to review :) That PR directly blocks this one, so once its approved and I rebase here this branch should pass all tests as well. |
|
Thanks @123epsilon. Looks like there is still some issue with the pyproject.toml file. Also, are you interested in becoming a collaborator? As you have probably noticed, I need help with reviewing PR and maintaining the project. See #252 |
|
@ekzhu Yes I see, I think some upstream changes to pybloomfilter may have reverted the changes I made to the pyproject.toml - I'll fix those up so we can wrap up this PR. And yes, I'd love to be a collaborator on this repo - I see that alot of people have expressed interest in the things you'd mentioned in #252 above. When I have some free time I could consider looking into GPU acceleration - particularly for MinHashing which in my experience is the major bottleneck in most of my workflows. Additionally - how would you feel about the possibility of a Rust backend? No promises on timeline there, but I've noticed that Rust tends to be blazing fast and simple to parallelize for the kinds of computations we're doing here. |
|
Ok @ekzhu, they dropped support for <3.9 in the latest version of the Bloom filter library we were using so I pinned to an earlier commit which should be fine for the forseeable future (the incompatibility seems to mainly be in build system rather than code - though there are some Cython considerations). CI is green :) |
Interesting idea! I am curious how things are going to integrate together. Maybe you can participate in the discussion there. |
|
@123epsilon looks like there is some conflict -- I resolved them |
|
@123epsilon do you have an x.com and/ or LinkedIn handle? I would like to post about your contribution. |
That would be awesome! I see you've found my linkedin: https://www.linkedin.com/in/arhammkhan/ I have an X acount here: https://x.com/arhamk1216 |
This PR adds the LSHBloom algorithm to datasketch along with tests, docs, and a simple benchmark that compares LSH to LSHBloom across various values of the new parameter
fp, which controls the false positive rate of the Bloom filter index that LSHBloom uses to save space.@ekzhu