Skip to content

feat(minhash): optional CuPy backend for MinHash.update_batch#286

Merged
ekzhu merged 11 commits intoekzhu:masterfrom
dipeshbabu:feature/gpu-integration
Nov 25, 2025
Merged

feat(minhash): optional CuPy backend for MinHash.update_batch#286
ekzhu merged 11 commits intoekzhu:masterfrom
dipeshbabu:feature/gpu-integration

Conversation

@dipeshbabu
Copy link
Copy Markdown
Contributor

Add an opt-in GPU backend for MinHash.update_batch using CuPy.

  • Hashing and permutation generation remain on CPU and unchanged.
  • When enable_gpu() is called and CuPy is available, the internal
    (hv * a + b) % _mersenne_prime & _max_hash computation and the
    per-column min reduction are performed on the GPU.
  • The CPU code path is preserved exactly as before, and hashvalues
    remain a NumPy uint64 array so downstream code (LSH, LSH Forest,
    LSH Ensemble, storage) is unaffected.

Includes optional tests (guarded by pytest.importorskip("cupy"))
that verify CPU and GPU produce identical hashvalues for the same
seed, num_perm, and input batches.

Copy link
Copy Markdown
Owner

@ekzhu ekzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a benchmark script to show the performance comparison -- just use randomly generated data.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an optional GPU backend for MinHash.update_batch() using CuPy, enabling GPU acceleration for the permutation computation and min reduction steps while keeping hashing and permutation generation on the CPU. The implementation preserves backward compatibility by making GPU support opt-in via an enable_gpu() method.

Key Changes

  • Added CuPy import with graceful fallback when unavailable
  • Implemented enable_gpu() method to opt into GPU acceleration
  • Refactored update_batch() to support both CPU and GPU code paths while maintaining identical results
  • Added comprehensive GPU tests verifying CPU/GPU output equivalence

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.

File Description
datasketch/minhash.py Added optional CuPy import, enable_gpu() method, _use_gpu flag, and dual CPU/GPU paths in update_batch()
test/test_minhash_gpu.py New test suite validating GPU implementation produces identical results to CPU for various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dipeshbabu
Copy link
Copy Markdown
Contributor Author

Can we add a benchmark script to show the performance comparison -- just use randomly generated data.

Yes, I am adding benchmark script as well

@dipeshbabu dipeshbabu requested a review from ekzhu November 17, 2025 01:56
Copy link
Copy Markdown
Owner

@ekzhu ekzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we show the benchmark plot in the PR description?

@dipeshbabu
Copy link
Copy Markdown
Contributor Author

dipeshbabu commented Nov 18, 2025

GPU vs CPU Benchmarks (update_batch)

GPU starts to help at larger batches and higher num_perm.

MinHash GPU Overviewminhash_gpu_overview

Per-size breakdowns:

n=1000minhash_gpu_size_1000

n=10000minhash_gpu_size_10000

n=50000minhash_gpu_size_50000

@dipeshbabu dipeshbabu requested a review from ekzhu November 18, 2025 01:25
@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Nov 18, 2025

@dipeshbabu , thanks for the figures! Can we update docs (https://github.com/ekzhu/datasketch/blob/master/docs/minhash.rst) and include these figures? I think we can have two plots:

  1. runtime comparison for CPU vs GPU over different num_perm, with fixed n
  2. runtime comparison for CPU vs GPU over different n, with fixed num_perm.

@dipeshbabu dipeshbabu requested a review from ekzhu November 18, 2025 05:05
@dipeshbabu
Copy link
Copy Markdown
Contributor Author

@dipeshbabu , thanks for the figures! Can we update docs (https://github.com/ekzhu/datasketch/blob/master/docs/minhash.rst) and include these figures? I think we can have two plots:

  1. runtime comparison for CPU vs GPU over different num_perm, with fixed n
  2. runtime comparison for CPU vs GPU over different n, with fixed num_perm.

I have updated the docs with mentioned figures. Could you review the PR now? @ekzhu

@ekzhu
Copy link
Copy Markdown
Owner

ekzhu commented Nov 24, 2025

@dipeshbabu thanks for the update! Can you keep only the two figures that was used in the doc?

@dipeshbabu dipeshbabu requested a review from ekzhu November 24, 2025 04:32
@dipeshbabu
Copy link
Copy Markdown
Contributor Author

@dipeshbabu thanks for the update! Can you keep only the two figures that was used in the doc?

@ekzhu Can you review it now? I updated and also added code snippet for using GPU with MinHash.

@ekzhu ekzhu merged commit a902423 into ekzhu:master Nov 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants