Skip to content

weighted min hash - minhash_many function #195

@dopc

Description

@dopc

hey, thanks for this great project.
I want to use min hash for my text embedding vectors which have both negative and positive numbers.
I have searched the issues and found that weighted min hash can be used for that.
I tried it and it actually works we.

my problem is about minhash_many function. its result is different than minhash function.
below is a minimal code to reproduce and a screenshot to demonstrate without running the code.

I want to use minhash_many since it is faster than for loop.
So is this normal or something unexpected.
thx.

from time import perf_counter as pc
from datasketch import WeightedMinHashGenerator

vectors = np.random.uniform(-1, 1, (20000, 100))

mg = WeightedMinHashGenerator(vectors.shape[1], 32)
t0 = pc()
many_result = np.array(list(map(lambda x: x.digest(), mg.minhash_many(vectors))))
print(f'shape many: {many_result.shape}')
print(f'time many: {pc()-t0:.3f}')
print(f'many_result[0][:10]:\n{many_result[0][:10]}\n')

t0 = pc()
for_result = np.array(list(map(lambda x: mg.minhash(x).digest(), vectors)))
print(f'shape for: {many_result.shape}')
print(f'time for: {pc()-t0:.3f}')
print(f'for_result[0][:10]:\n{for_result[0][:10]}')

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions