feat(minhash): optional CuPy backend for MinHash.update_batch by dipeshbabu · Pull Request #286 · ekzhu/datasketch

dipeshbabu · 2025-11-12T03:19:43Z

Add an opt-in GPU backend for MinHash.update_batch using CuPy.

Hashing and permutation generation remain on CPU and unchanged.
When enable_gpu() is called and CuPy is available, the internal
(hv * a + b) % _mersenne_prime & _max_hash computation and the
per-column min reduction are performed on the GPU.
The CPU code path is preserved exactly as before, and hashvalues
remain a NumPy uint64 array so downstream code (LSH, LSH Forest,
LSH Ensemble, storage) is unaffected.

Includes optional tests (guarded by pytest.importorskip("cupy"))
that verify CPU and GPU produce identical hashvalues for the same
seed, num_perm, and input batches.

ekzhu

Can we add a benchmark script to show the performance comparison -- just use randomly generated data.

datasketch/minhash.py

test/test_minhash_gpu.py

datasketch/minhash.py

Copilot

Pull Request Overview

This PR adds an optional GPU backend for MinHash.update_batch() using CuPy, enabling GPU acceleration for the permutation computation and min reduction steps while keeping hashing and permutation generation on the CPU. The implementation preserves backward compatibility by making GPU support opt-in via an enable_gpu() method.

Key Changes

Added CuPy import with graceful fallback when unavailable
Implemented enable_gpu() method to opt into GPU acceleration
Refactored update_batch() to support both CPU and GPU code paths while maintaining identical results
Added comprehensive GPU tests verifying CPU/GPU output equivalence

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.

File	Description
`datasketch/minhash.py`	Added optional CuPy import, `enable_gpu()` method, `_use_gpu` flag, and dual CPU/GPU paths in `update_batch()`
`test/test_minhash_gpu.py`	New test suite validating GPU implementation produces identical results to CPU for various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

datasketch/minhash.py

test/test_minhash_gpu.py

datasketch/minhash.py

test/test_minhash_gpu.py

datasketch/minhash.py

test/test_minhash_gpu.py

datasketch/minhash.py

dipeshbabu · 2025-11-16T23:15:38Z

Can we add a benchmark script to show the performance comparison -- just use randomly generated data.

Yes, I am adding benchmark script as well

ekzhu

Can we show the benchmark plot in the PR description?

datasketch/minhash.py

dipeshbabu · 2025-11-18T01:25:08Z

GPU vs CPU Benchmarks (update_batch)

GPU starts to help at larger batches and higher num_perm.

MinHash GPU Overview

Per-size breakdowns:

n=1000

n=10000

n=50000

ekzhu · 2025-11-18T03:23:19Z

@dipeshbabu , thanks for the figures! Can we update docs (https://github.com/ekzhu/datasketch/blob/master/docs/minhash.rst) and include these figures? I think we can have two plots:

runtime comparison for CPU vs GPU over different num_perm, with fixed n
runtime comparison for CPU vs GPU over different n, with fixed num_perm.

datasketch/minhash.py

docs/minhash.rst

datasketch/minhash.py

…docs

dipeshbabu · 2025-11-23T21:17:28Z

@dipeshbabu , thanks for the figures! Can we update docs (https://github.com/ekzhu/datasketch/blob/master/docs/minhash.rst) and include these figures? I think we can have two plots:

runtime comparison for CPU vs GPU over different num_perm, with fixed n

runtime comparison for CPU vs GPU over different n, with fixed num_perm.

I have updated the docs with mentioned figures. Could you review the PR now? @ekzhu

ekzhu · 2025-11-24T04:06:41Z

@dipeshbabu thanks for the update! Can you keep only the two figures that was used in the doc?

docs/minhash.rst

…bu/datasketch into feature/gpu-integration

dipeshbabu · 2025-11-25T22:39:37Z

@dipeshbabu thanks for the update! Can you keep only the two figures that was used in the doc?

@ekzhu Can you review it now? I updated and also added code snippet for using GPU with MinHash.

dipeshbabu and others added 4 commits November 11, 2025 19:53

fix: python version in readme

14551a0

feat: add optional CuPy backend for update_batch

b838e67

Merge branch 'master' into feature/gpu-integration

5c2882c

Merge branch 'master' into feature/gpu-integration

352e973

ekzhu requested a review from Copilot November 16, 2025 22:41

Copilot started reviewing on behalf of ekzhu November 16, 2025 22:41 View session

Copilot finished reviewing on behalf of ekzhu November 16, 2025 22:44

ekzhu reviewed Nov 16, 2025

View reviewed changes

datasketch/minhash.py Outdated Show resolved Hide resolved

test/test_minhash_gpu.py Show resolved Hide resolved

datasketch/minhash.py Outdated Show resolved Hide resolved

Copilot AI reviewed Nov 16, 2025

View reviewed changes

feat: add benchmark for cpu vs gpu, docs, and all

0512c4a

dipeshbabu requested a review from ekzhu November 17, 2025 01:56

ekzhu reviewed Nov 17, 2025

View reviewed changes

datasketch/minhash.py Outdated Show resolved Hide resolved

datasketch/minhash.py Outdated Show resolved Hide resolved

datasketch/minhash.py Outdated Show resolved Hide resolved

datasketch/minhash.py Outdated Show resolved Hide resolved

datasketch/minhash.py Outdated Show resolved Hide resolved

dipeshbabu added 2 commits November 17, 2025 20:15

feat: update API doc string, different GPU settings

563936a

fix: per call override

b4d83ba

dipeshbabu requested a review from ekzhu November 18, 2025 01:25

ekzhu reviewed Nov 18, 2025

View reviewed changes

datasketch/minhash.py Outdated Show resolved Hide resolved

docs/minhash.rst Outdated Show resolved Hide resolved

docs/minhash.rst Outdated Show resolved Hide resolved

datasketch/minhash.py Outdated Show resolved Hide resolved

datasketch/minhash.py Show resolved Hide resolved

feat: skip benchmark code block, add results with discussion, update …

d74c9ed

…docs

dipeshbabu requested a review from ekzhu November 18, 2025 05:05

Merge branch 'master' into feature/gpu-integration

c07f37b

ekzhu reviewed Nov 24, 2025

View reviewed changes

docs/minhash.rst Show resolved Hide resolved

dipeshbabu added 2 commits November 23, 2025 23:30

feat: keeping only used plots and tiny code block for cpu/gpu usage

a032cb1

Merge branch 'feature/gpu-integration' of https://github.com/dipeshba…

566e5e9

…bu/datasketch into feature/gpu-integration

dipeshbabu requested a review from ekzhu November 24, 2025 04:32

ekzhu approved these changes Nov 25, 2025

View reviewed changes

ekzhu merged commit a902423 into ekzhu:master Nov 25, 2025
8 checks passed

Conversation

dipeshbabu commented Nov 12, 2025

Uh oh!

ekzhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dipeshbabu commented Nov 16, 2025

Uh oh!

ekzhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dipeshbabu commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU vs CPU Benchmarks (update_batch)

Uh oh!

ekzhu commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dipeshbabu commented Nov 23, 2025

Uh oh!

ekzhu commented Nov 24, 2025

Uh oh!

Uh oh!

dipeshbabu commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dipeshbabu commented Nov 18, 2025 •

edited

Loading

ekzhu commented Nov 18, 2025 •

edited

Loading