[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC) by maheshakya · Pull Request #3304 · scikit-learn/scikit-learn

maheshakya · 2014-06-21T07:38:02Z

No description provided.

Rewritten _bisect_right with numpy searchsorted. Updated examples.

…t test group.

jnothman · 2014-06-22T05:20:25Z

sklearn/neighbors/lsh_forest.py

Please store all arguments to the constructor just as they are passed in. Please see other uses of check_random_state and where it is called.

Replaced numpy searchsorted in _bisect_right with the previous version.

maheshakya · 2014-06-22T19:35:12Z

@jnothman, I'm adding insert operation into this data structure. I suppose that could help in incremental learning.

Insert operation allows to insert new data points into the fitted set of trees. (Can be used in incremental learning? ) Changed parameter m to n_neighbors. Changed parameter m to n_neighbors.

coveralls · 2014-06-23T08:19:38Z

Coverage decreased (-0.06%) when pulling 802ed5f on maheshakya:lsh_forest into aaefdbd on scikit-learn:master.

jnothman · 2014-06-23T12:13:35Z

Did I say something about incremental learning?

maheshakya · 2014-06-23T13:58:42Z

Yes.
Check issue #3175

jnothman · 2014-06-23T14:16:57Z

Ahh =) That was merely a ping.

On 23 June 2014 09:58, maheshakya notifications@github.com wrote:

Yes.
Check issue #3175 #3175

—
Reply to this email directly or view it on GitHub
#3304 (comment)
.

arjoly · 2014-06-24T06:43:24Z

sklearn/feature_extraction/lshashing.py

You might want to have a look at the random projection module.

A random sign projection transformer could have its place in that module.

robertlayton · 2014-07-03T07:07:43Z

sklearn/neighbors/lsh_forest.py

Try this first: http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html
As a second optimisation, consider how it might be possible to compute all the trees (and so on) in one numpy operation etc, to get rid of the previous loop. Dot products are your friend!

Yes, If all the hash functions are computed in advance, I think it's possible to get rid of the loop. I'll give it a try.

For _bisect_right() function, a transformed x is passed. The transformation will replace the characters after h hash length with '1's. Used random_projections module. GuassianRandomProjections in random_projections module is used to perform the hashing for Random projections LSH method.

GuassianRandomProjections in random_projections module is used to perform the hashing for Random projections LSH method.

coveralls · 2014-07-04T12:15:10Z

Coverage decreased (-0.06%) when pulling d8e521b on maheshakya:lsh_forest into 82611e8 on scikit-learn:master.

Removed lshashinng in feature extraction and add that funtionality in the LSHForest class. If other hashing algorithms are to be implemented, a separate lshashing class may be required.

coveralls · 2014-07-14T04:05:49Z

Coverage decreased (-0.05%) when pulling 57d9412 on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

ogrisel · 2014-11-05T15:01:46Z

Is there an easy way to tell python to prefer the version in the current directory?

Honestly it's better to be explicit and install the version you want (or learn to use virtualenv if you want to switch between many different versions of a project on the same host).

To find the folder of the version that gets picked up when you import sklearn just do:

python -c "import sklearn; print(sklearn.__path__[0])"

Check that this matches what pip sees:

pip show scikit-learn

If not, it means that the pip command that you have in your PATH does not use the python command that you have in your PATH.

Uninstall previous versions of scikit-learn with:

pip uninstall scikit-learn

If you don't have pip installed you can delete the folder returned by python -c "import sklearn; print(sklearn.__path__[0])".

Then do:

pip show scikit-learn
python -c "import sklearn; print(sklearn.__path__[0])"

to check that you no longer have any version of scikit-learn installed for this combo of python & pip.

Then go into the scikit-learn source folder, checkout this branch and do:

python setup.py build_ext --inplace  # builds the compiled extension in the source folder
python setup.py develop

or alternatively:

 pip install --editable /path/to/scikit-learn

In both cases this should install scikit-learn in development mode, meaning that when you import sklearn it should use the live source code from your local git repository.

Again you can check with:

python -c "import sklearn; print(sklearn.__path__[0])"
pip show scikit-learn

ogrisel · 2014-11-05T15:08:03Z

In the mean time I can throw out a hypothesis: under cosine metric if the blobs are all far from zero, then most hyperplanes do not separate the data in a blob at all or much. High dimensions are weird. In this case, many points would clash even on 32 hash bits.

This is what I suspect. Maybe thresholding at np.median(X_projected[:1000], axis=0) instead of 0 might help find better splits. Otherwise, we could use random samples from X as hyperplanes instead of random Gaussian vectors, possibly combined with the median intercept.

In this case an implementation should still only look at a bounded number of candidates, thus drop accuracy, not speed, IMO.

Not sure: if samples all collide in 10 buckets (e.g. uniformly) instead of 2 ** 32, we should get an approximately constant ~10 / n_estimators x speedup w.r.t. brute force and a therefore linear scaling.

Or it could be a plain bug, etc.

This is always a possibility.

…of accuracies.

daniel-vainsencher · 2014-11-11T21:28:43Z

Sorry, I cannot help with the practical investigation.

Defining data dependent hash families, to my knowledge, is more of an
art than a science and out of scope for this PR. Unless practical
testing turns up a bug, I propose to document that implementation
details limit the speed up to an application dependent constant, so YMMV.

Another relatively easy option is to go to 64 hash bits (but I would
still not do it for this PR).

Daniel

On 11/05/2014 05:08 PM, Olivier Grisel wrote:

In the mean time I can throw out a hypothesis: under cosine metric
if the blobs are all far from zero, then most hyperplanes do not
separate the data in a blob at all or much. High dimensions are
weird. In this case, many points would clash even on 32 hash bits.
This is what I suspect. Maybe thresholding at
|np.median(X_projected[:1000], axis=0)| instead of |0| might help find
better splits. Otherwise, we could use random samples from X as
hyperplanes instead of random Gaussian vectors, possibly combined with
the median intercept.
In this case an implementation should still only look at a bounded
number of candidates, thus drop accuracy, not speed, IMO.
Not sure: if samples all collide in 10 buckets (e.g. uniformly) instead
of |2 ** 32|, we should get an approximately constant ~10 / n_estimators
x speedup w.r.t. brute force and a therefore linear scaling.
Or it could be a plain bug, etc.
This is always a possibility.

—
Reply to this email directly or view it on GitHub
#3304 (comment).

jnothman · 2014-11-12T14:55:22Z

sklearn/neighbors/approximate.py

ogrisel · 2014-11-14T16:59:56Z

@maheshakya Please consider the following doc improvement: maheshakya#10

ogrisel · 2014-11-14T17:08:47Z

Defining data dependent hash families, to my knowledge, is more of an art than a science and out of scope for this PR.

Agreed. Although it might be interesting to consider the refactoring proposed by @jnothman as a mean to make it possible for the users to implement such data-depenent hash families by them-selves. In particular I would like to experiment with combining RandomTreesEmbedding and sparse random projections.

Unless practical testing turns up a bug, I propose to document that implementation details limit the speed up to an application dependent constant, so YMMV.

Yes after more testing with tweaking the parameters of the scalability example I think that the observed profile is ok. I updated the doc to reflect my findings in the changes I proposed in maheshakya#10.

Another relatively easy option is to go to 64 hash bits (but I would still not do it for this PR).

I have tried (this is easy with a2fe63a, just replace '>u4' by '>u8'): it does not seem to impact the scalability profile significantly but using 64 bit integers has a performance overhead. Let's stick to 32 bit for now.

Various improvements for LSHForest

FIX broken doctest

jnothman · 2014-11-19T03:01:19Z

sklearn/neighbors/approximate.py

You need an arg y=None here and in transform to stop Travis complaining

Thanks.

A test still fails because fitted X is not of a dimensionality of multiple of 8. Perhaps we should remove the value error since we are using predefined hash size MAX_HASH_SIZE = np.dtype(HASH_DTYPE).itemsize * 8. But it maybe a problem if this is being used somewhere else.
Is there a workaround for this?

X need not have a dimensionality of a multiple of 8. GaussianProjections().fit_transform(X) must have a multiple of 8 features. I don't understand why the error could be firing if n_components is defined to be a multiple of 8.

Do not remove the exception. The usage of this will change over time, and it is an important assertion, without which packbits will give us something useless (because hash boundaries will be mid-byte).

Oh, I get it. It's failing an invariance test. The easiest option is to add it to DONT_TEST in sklearn.utils.testing which may be reasonable in this case.

the other option is to change the default n_components in GaussianRandomProjectionsHash

i.e. by defining __init__

I did this:

class GaussianRandomProjectionHash(ProjectionToHashMixin, GaussianRandomProjection): """Use GaussianRandomProjection to produce a cosine LSH fingerprint""" def __init__(self, n_components=8, random_state=None): super(GaussianRandomProjectionHash, self).__init__( n_components=n_components, random_state=random_state)

But still getting the same error.

If you commit the code, I could see the error... or you could paste the error in more detail. which test is failing??

I think we should use the other solution you mentioned because this method fails anyway as HASH_DTYPE which is specific to this module is used in _to_hash.

maheshakya · 2014-11-25T04:43:41Z

Is there anything else that needs to be done except applying LSH forest in regression and classification?

jnothman · 2014-11-25T05:23:57Z

I think I need to give this another look over, then I expect it will have
my support for merge. Then you need a second supporter. In my opinion,
there are a number of extensions to this contribution, both internal to
LSHForest and in terms of its integration in
classification/regression/clustering, which can be done in later PRs, by
other contributors. If we're happy with the API, this PR should have a
feature freeze.

In any case I think this PR needs to end. It's a fairly hefty page just to
download and render in my web browser or switch tabs. If we're going to
have any more conversation, I'd consider closing this PR and opening a new
one.

On 25 November 2014 at 15:43, Maheshakya Wijewardena <
notifications@github.com> wrote:

Is there anything else that needs to be done except applying LSH forest
in regression and classification?

—
Reply to this email directly or view it on GitHub
#3304 (comment)
.

maheshakya · 2014-11-25T05:59:37Z

I have done a little modification to the description of hyper parameters example as it didn't reflect the latest changes.

Please have a look at those descriptions and the documentation as well.

coveralls · 2014-11-25T06:08:04Z

Coverage increased (+0.07%) when pulling 2edabc8 on maheshakya:lsh_forest into 3f49cee on scikit-learn:master.

maheshakya · 2014-11-27T04:38:19Z

New PR created #3894

maheshakya added 5 commits June 21, 2014 13:02

Implemented LSH algorithms: Random projections

757803f

Implemented LSH forest

8c396e7

Fixed typos

7cae820

Rewritten _bisect_right with numpy searchsorted. Updated examples.

Rewritten _bisect_right with numpy searchsorted. Updated examples.

8986078

LSH algorithms are not dafault constructable, therefore added to don'…

096283d

…t test group.

jnothman reviewed Jun 22, 2014
View reviewed changes

Corrected usage of random_state parameter.

08ab73a

Replaced numpy searchsorted in _bisect_right with the previous version.

maheshakya added 2 commits June 23, 2014 13:37

Added insert operation into LSH Forest.

c94cb5d

Insert operation allows to insert new data points into the fitted set of trees. (Can be used in incremental learning? ) Changed parameter m to n_neighbors. Changed parameter m to n_neighbors.

Changed parameter m to n_neighbors.

802ed5f

arjoly reviewed Jun 24, 2014
View reviewed changes

robertlayton assigned robertlayton and unassigned robertlayton Jul 2, 2014

robertlayton reviewed Jul 3, 2014
View reviewed changes

maheshakya added 2 commits July 4, 2014 17:30

Used random_projections module.

d8e521b

GuassianRandomProjections in random_projections module is used to perform the hashing for Random projections LSH method.

maheshakya added 3 commits July 5, 2014 15:38

A minor change in lshashing

a9b49bb

Gaussian Random Projection is used in the LSHForest class.

eb4852d

Removed lshashinng in feature extraction and add that funtionality in the LSHForest class. If other hashing algorithms are to be implemented, a separate lshashing class may be required.

Remove Random projection from feature extraction _init

57d9412

maheshakya added 2 commits July 14, 2014 13:22

Converted to integer representation.

b19ee99

Updated example

a7a5788

maheshakya added 4 commits November 11, 2014 20:01

DOC Modified approximate nearest neighbors documentation.

8221732

ENH Modified hyperparameters example to calculate stadard deviations …

1476321

…of accuracies.

ENH Modified approximate neighbors doc string.

eacb2bb

ENH Changed doc string and comments in test_distance function.

e58dca6

jnothman reviewed Nov 12, 2014
View reviewed changes

sklearn/neighbors/approximate.py Outdated

Copy link
Copy Markdown

Member

jnothman Nov 12, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*dimension

ogrisel added 2 commits November 14, 2014 17:49

ENH more LSH scalability example and doc improvements

e20e01e

ENH cosmetics in LSHForest docstrings

1efc8b0

maheshakya and others added 4 commits November 15, 2014 19:48

Merge pull request #10 from ogrisel/ogrisel-pr-3304

3aed5b6

Various improvements for LSHForest

FIX broken doctest

9238150

Merge pull request #11 from ogrisel/ogrisel-pr-3304

4233d97

FIX broken doctest

FIX Fixed typo in the docstring

cae5a8a

ogrisel mentioned this pull request Nov 17, 2014

[MRG] remove useless warning in Random Projections #3856

Closed

ENH Used GaussianRandomProjectionHash for hadling hashing.

a13ddb4

jnothman reviewed Nov 19, 2014
View reviewed changes

maheshakya added 2 commits November 19, 2014 12:24

FIX Added __init__ to GaussianRandomProjectionHash.

58e762d

Added GaussianRandomProjectionHash into DONT_TEST

79c9670

ENH Changed description of hyper-parameters example.

2edabc8

maheshakya closed this Nov 27, 2014

maheshakya mentioned this pull request Nov 27, 2014

[MRG+1] Locality Sensitive Hashing for approximate nearest neighbor search #3894

Closed

Uh oh!

Conversation

maheshakya commented Jun 21, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maheshakya commented Jun 22, 2014

Uh oh!

coveralls commented Jun 23, 2014

Uh oh!

jnothman commented Jun 23, 2014

Uh oh!

maheshakya commented Jun 23, 2014

Uh oh!

jnothman commented Jun 23, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jul 4, 2014

Uh oh!

coveralls commented Jul 14, 2014

Uh oh!

ogrisel commented Nov 5, 2014

Uh oh!

ogrisel commented Nov 5, 2014

Uh oh!

daniel-vainsencher commented Nov 11, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 14, 2014

Uh oh!

ogrisel commented Nov 14, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maheshakya commented Nov 25, 2014

Uh oh!

jnothman commented Nov 25, 2014

Uh oh!

maheshakya commented Nov 25, 2014

Uh oh!

coveralls commented Nov 25, 2014

Uh oh!

maheshakya commented Nov 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants