Skip to content

[MRG+1] Locality Sensitive Hashing for approximate nearest neighbor search#3894

Closed
maheshakya wants to merge 131 commits intoscikit-learn:masterfrom
maheshakya:lsh_forest
Closed

[MRG+1] Locality Sensitive Hashing for approximate nearest neighbor search#3894
maheshakya wants to merge 131 commits intoscikit-learn:masterfrom
maheshakya:lsh_forest

Conversation

@maheshakya
Copy link
Copy Markdown
Contributor

New PR created from #3304

Rewritten _bisect_right with numpy searchsorted. Updated examples.
Replaced numpy searchsorted  in _bisect_right with the
previous version.
Insert operation allows to insert new data points into the fitted set of trees.
(Can be used in incremental learning? )

Changed parameter m to n_neighbors.

Changed parameter m to n_neighbors.
For _bisect_right() function, a transformed x is passed. The transformation
will replace the characters after h hash length with '1's.

Used random_projections module.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
Removed lshashinng in feature extraction and add that funtionality in
the LSHForest class. If other hashing algorithms are to be implemented,
a separate lshashing class may be required.
@maheshakya
Copy link
Copy Markdown
Contributor Author

All right.

@maheshakya
Copy link
Copy Markdown
Contributor Author

Raising the warning with stacklevel=2 doesn't seem to provide any more information about the warning.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Raising the warning with stacklevel=2 doesn't seem to provide any more
information about the warning.

No, that's not what stacklevel does.

@jnothman
Copy link
Copy Markdown
Member

It might give you a better indication of which function call results in the warning, though.

@maheshakya
Copy link
Copy Markdown
Contributor Author

Yes. it's _fast_dot in extmath where numpy version is less than 1.7.2. Failure occurs in the configuration where numpy version = 1.6.2. The dot product is done in the transformation of query to hashed query using GaussianRandomProjectionHash.transform. In this, transformed queries have dtype=uint32. But _fast_dot only allows dtypes of 32 and 64 bit float. This is the reason for this warning. But dtype needs to be uint32 for the algorithm to work. What can we do about this?

@maheshakya
Copy link
Copy Markdown
Contributor Author

And this is happened before the expected warning in _get_candidates method.

@jnothman
Copy link
Copy Markdown
Member

Surely the dot product is happening in the random projection process, not in the transformation to a hash? And even if the LHS of the dot product is integer queries, the RHS should be a gaussian-distributed hyperplane matrix, i.e. float-valued. So this must be a fixed bug in numpy that should cast the input to compatible types before calling BLAS. The correct solution in this case should be to modify assert_warns_message so that its behaviour is like that of assert_warns: it shouldn't care about order.

@maheshakya
Copy link
Copy Markdown
Contributor Author

Seems like that. I'll fix assert_warns_message.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.04%) when pulling 7802214 on maheshakya:lsh_forest into 15157f9 on scikit-learn:master.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be if not message_found:

@maheshakya
Copy link
Copy Markdown
Contributor Author

@jnothman shall I change this to [MRG+1]?

@jnothman
Copy link
Copy Markdown
Member

Yes!

@maheshakya
Copy link
Copy Markdown
Contributor Author

Thanks.

@maheshakya maheshakya changed the title [MRG] Locality Sensitive Hashing for approximate nearest neighbor search [MRG+1] Locality Sensitive Hashing for approximate nearest neighbor search Dec 18, 2014
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no /tmp on Windows machines, probably.

@larsmans
Copy link
Copy Markdown
Member

I'm rebasing this. I figured I could just merge this despite the number of commits, but some of the log messages (e.g. 802ed5f) don't reflect the changes made in the commits.

@larsmans larsmans mentioned this pull request Dec 18, 2014
@larsmans
Copy link
Copy Markdown
Member

Squashed to 62 commits and merged as #3980. Fixed the issues that @GaelVaroquaux and I found. Thanks @maheshakya!

@larsmans larsmans closed this Dec 18, 2014
@maheshakya
Copy link
Copy Markdown
Contributor Author

You are welcome @larsmans

@maheshakya maheshakya deleted the lsh_forest branch December 19, 2014 17:01
@jnothman
Copy link
Copy Markdown
Member

Congrats @maheshakya!!

@maheshakya
Copy link
Copy Markdown
Contributor Author

Thank you @jnothman , I'm really grateful to you for reviewing and suggestions for improvements and for the support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants