Skip to content

Merging (Identically Specified) MinHashLSH objects #205

@hsicsa

Description

@hsicsa

Background:

When using a dataflow pipeline, the input stream (e.g. of documents) can be split to increase parallelism & throughput across multiple machines.
In order to do so, when computing the MinHashLSH of the stream, the MinHashLSH objects in two or more sub-streams must be merged.

Generally speaking, LSH implementations do not support merging two or more LSH states into a single LSH state.

However, MinHashLSH (as implemented in datasketch) does appear to be capable of doing so when all the parameters of the objects to be merged are identical. Beyond the threshold, # permutations of the minHash values, ad weights, when the number of hash tables and the associated set of hashvalue ranges are identical, and it seems reasonable to support the merging of the underlying hash tables.

Proposal:

Create a mergeMinHashLSH function that does the following:

  • Takes a list of MinHashLSH objects
  • Asserts that the initialization parameters are identical for all objects
  • Performs a binary merge (i.e. O(log N)) of the objects underlying hash tables, while updating the keys and hashvalues.
  • returns a newly instantiated MinHashLSH object

Additionally, a mergeMinHashLSH method could be added to the MinHashLSH class to perform merging in place.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions