-
Notifications
You must be signed in to change notification settings - Fork 315
Implementing MinHash retrieval from keys for MinHashLSHForest #233
Description
Hi, I've been using the MinHashLSHForest class to do some deduplication, and part of that pipeline is to retrieve the top-k items and then estimate the Jaccard similarities with each of those items, obviously this requires reconstructing the MinHash object related to the given key returned by MinHashLSHForest.query().
This seems like a decently common use-case since we often screen the results of LSH Forest using the Jaccard Similarity, my question is do you feel that this is common enough to implement such a function as a part of the MinHashLSHForest class?
I've implemented a simple way to recompute the original hashvalues array from the keys dictionary in MinHashLSHForest as follows, I'd be happy to submit a PR but just wanted to know how this aligned with the vision for this package
"""
Takes the list of bytes-like generated by the LSH Forest
that corresponds to some given key and recovers the hashvalues
which can be converted back into a MinHash to compute Jaccard Similarity
Given a number of prefix trees, L, when we insert a (key, MinHash) pair
the LSH Forest creates L byteslike items each corresponding to a range of
hashvalues from the original MinHash object for a given key. Each range is of
size num_perm / L. Therefore here we convert these items from byteslikes back into
arrays of unsigned integers and then concatenate them so that they are in a representation
that we can build a MinHash object with. Namely, we return an array of unsigned integers
of length num_perm that represent hashvalues from each of num_perm hash functions
chosen during the MinHash creation.
"""
def byteslist_to_hashvalues(byteslist):
hashvalues = np.array([], dtype=np.uint64)
for item in byteslist:
# unswap the bytes, as their representation is flipped during storage
hv_segment = np.frombuffer(item, dtype=np.uint64).byteswap()
hashvalues = np.append(hashvalues, hv_segment)
return hashvalueswhere we might call this by using
lsh = MinHashLSHForest(...)
hv = byteslist_to_hashvalues(lsh.keys[mykey])
mh = MinHash(hashvalues=hv)
# now use mh.jaccard() ...
...A unit test would involve inserting a MinHash into MinHashLSHForest and then reconstructing it and checking for jaccard_sim == 1.0.