-
Notifications
You must be signed in to change notification settings - Fork 315
weighted min hash - minhash_many function #195
Copy link
Copy link
Open
Description
hey, thanks for this great project.
I want to use min hash for my text embedding vectors which have both negative and positive numbers.
I have searched the issues and found that weighted min hash can be used for that.
I tried it and it actually works we.
my problem is about minhash_many function. its result is different than minhash function.
below is a minimal code to reproduce and a screenshot to demonstrate without running the code.
I want to use minhash_many since it is faster than for loop.
So is this normal or something unexpected.
thx.
from time import perf_counter as pc
from datasketch import WeightedMinHashGenerator
vectors = np.random.uniform(-1, 1, (20000, 100))
mg = WeightedMinHashGenerator(vectors.shape[1], 32)
t0 = pc()
many_result = np.array(list(map(lambda x: x.digest(), mg.minhash_many(vectors))))
print(f'shape many: {many_result.shape}')
print(f'time many: {pc()-t0:.3f}')
print(f'many_result[0][:10]:\n{many_result[0][:10]}\n')
t0 = pc()
for_result = np.array(list(map(lambda x: mg.minhash(x).digest(), vectors)))
print(f'shape for: {many_result.shape}')
print(f'time for: {pc()-t0:.3f}')
print(f'for_result[0][:10]:\n{for_result[0][:10]}')
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
