Skip to content

Implementing MinHash retrieval from keys for MinHashLSHForest #233

@123epsilon

Description

@123epsilon

Hi, I've been using the MinHashLSHForest class to do some deduplication, and part of that pipeline is to retrieve the top-k items and then estimate the Jaccard similarities with each of those items, obviously this requires reconstructing the MinHash object related to the given key returned by MinHashLSHForest.query().

This seems like a decently common use-case since we often screen the results of LSH Forest using the Jaccard Similarity, my question is do you feel that this is common enough to implement such a function as a part of the MinHashLSHForest class?

I've implemented a simple way to recompute the original hashvalues array from the keys dictionary in MinHashLSHForest as follows, I'd be happy to submit a PR but just wanted to know how this aligned with the vision for this package

"""
Takes the list of bytes-like generated by the LSH Forest 
that corresponds to some given key and recovers the hashvalues
which can be converted back into a MinHash to compute Jaccard Similarity

Given a number of prefix trees, L, when we insert a (key, MinHash) pair 
the LSH Forest creates L byteslike items each corresponding to a range of 
hashvalues from the original MinHash object for a given key. Each range is of
size num_perm / L. Therefore here we convert these items from byteslikes back into
arrays of unsigned integers and then concatenate them so that they are in a representation
that we can build a MinHash object with. Namely, we return an array of unsigned integers
of length num_perm that represent hashvalues from each of num_perm hash functions
chosen during the MinHash creation.
"""
def byteslist_to_hashvalues(byteslist):
    hashvalues = np.array([], dtype=np.uint64)
    for item in byteslist:
        # unswap the bytes, as their representation is flipped during storage
        hv_segment = np.frombuffer(item, dtype=np.uint64).byteswap()
        hashvalues = np.append(hashvalues, hv_segment)
    
    return hashvalues

where we might call this by using

lsh = MinHashLSHForest(...)
hv = byteslist_to_hashvalues(lsh.keys[mykey])
mh = MinHash(hashvalues=hv)

# now use mh.jaccard() ...
...

A unit test would involve inserting a MinHash into MinHashLSHForest and then reconstructing it and checking for jaccard_sim == 1.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions