feat(minhash): optional CuPy backend for MinHash.update_batch#286
feat(minhash): optional CuPy backend for MinHash.update_batch#286ekzhu merged 11 commits intoekzhu:masterfrom
Conversation
ekzhu
left a comment
There was a problem hiding this comment.
Can we add a benchmark script to show the performance comparison -- just use randomly generated data.
There was a problem hiding this comment.
Pull Request Overview
This PR adds an optional GPU backend for MinHash.update_batch() using CuPy, enabling GPU acceleration for the permutation computation and min reduction steps while keeping hashing and permutation generation on the CPU. The implementation preserves backward compatibility by making GPU support opt-in via an enable_gpu() method.
Key Changes
- Added CuPy import with graceful fallback when unavailable
- Implemented
enable_gpu()method to opt into GPU acceleration - Refactored
update_batch()to support both CPU and GPU code paths while maintaining identical results - Added comprehensive GPU tests verifying CPU/GPU output equivalence
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
datasketch/minhash.py |
Added optional CuPy import, enable_gpu() method, _use_gpu flag, and dual CPU/GPU paths in update_batch() |
test/test_minhash_gpu.py |
New test suite validating GPU implementation produces identical results to CPU for various scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Yes, I am adding benchmark script as well |
ekzhu
left a comment
There was a problem hiding this comment.
Can we show the benchmark plot in the PR description?
|
@dipeshbabu , thanks for the figures! Can we update docs (https://github.com/ekzhu/datasketch/blob/master/docs/minhash.rst) and include these figures? I think we can have two plots:
|
I have updated the docs with mentioned figures. Could you review the PR now? @ekzhu |
|
@dipeshbabu thanks for the update! Can you keep only the two figures that was used in the doc? |
@ekzhu Can you review it now? I updated and also added code snippet for using GPU with MinHash. |




Add an opt-in GPU backend for MinHash.update_batch using CuPy.
(hv * a + b) % _mersenne_prime & _max_hash computation and the
per-column min reduction are performed on the GPU.
remain a NumPy uint64 array so downstream code (LSH, LSH Forest,
LSH Ensemble, storage) is unaffected.
Includes optional tests (guarded by pytest.importorskip("cupy"))
that verify CPU and GPU produce identical hashvalues for the same
seed, num_perm, and input batches.