Skip to content

[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC)#3304

Closed
maheshakya wants to merge 119 commits intoscikit-learn:masterfrom
maheshakya:lsh_forest
Closed

[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC)#3304
maheshakya wants to merge 119 commits intoscikit-learn:masterfrom
maheshakya:lsh_forest

Conversation

@maheshakya
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please store all arguments to the constructor just as they are passed in. Please see other uses of check_random_state and where it is called.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Replaced numpy searchsorted  in _bisect_right with the
previous version.
@maheshakya
Copy link
Copy Markdown
Contributor Author

@jnothman, I'm adding insert operation into this data structure. I suppose that could help in incremental learning.

Insert operation allows to insert new data points into the fitted set of trees.
(Can be used in incremental learning? )

Changed parameter m to n_neighbors.

Changed parameter m to n_neighbors.
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.06%) when pulling 802ed5f on maheshakya:lsh_forest into aaefdbd on scikit-learn:master.

@jnothman
Copy link
Copy Markdown
Member

Did I say something about incremental learning?

@maheshakya
Copy link
Copy Markdown
Contributor Author

Yes.
Check issue #3175

@jnothman
Copy link
Copy Markdown
Member

Ahh =) That was merely a ping.

On 23 June 2014 09:58, maheshakya notifications@github.com wrote:

Yes.
Check issue #3175 #3175


Reply to this email directly or view it on GitHub
#3304 (comment)
.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to have a look at the random projection module.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A random sign projection transformer could have its place in that module.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try this first: http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html
As a second optimisation, consider how it might be possible to compute all the trees (and so on) in one numpy operation etc, to get rid of the previous loop. Dot products are your friend!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, If all the hash functions are computed in advance, I think it's possible to get rid of the loop. I'll give it a try.

For _bisect_right() function, a transformed x is passed. The transformation
will replace the characters after h hash length with '1's.

Used random_projections module.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.06%) when pulling d8e521b on maheshakya:lsh_forest into 82611e8 on scikit-learn:master.

Removed lshashinng in feature extraction and add that funtionality in
the LSHForest class. If other hashing algorithms are to be implemented,
a separate lshashing class may be required.
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.05%) when pulling 57d9412 on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 5, 2014

Is there an easy way to tell python to prefer the version in the current directory?

Honestly it's better to be explicit and install the version you want (or learn to use virtualenv if you want to switch between many different versions of a project on the same host).

To find the folder of the version that gets picked up when you import sklearn just do:

python -c "import sklearn; print(sklearn.__path__[0])"

Check that this matches what pip sees:

pip show scikit-learn

If not, it means that the pip command that you have in your PATH does not use the python command that you have in your PATH.

Uninstall previous versions of scikit-learn with:

pip uninstall scikit-learn

If you don't have pip installed you can delete the folder returned by python -c "import sklearn; print(sklearn.__path__[0])".

Then do:

pip show scikit-learn
python -c "import sklearn; print(sklearn.__path__[0])"

to check that you no longer have any version of scikit-learn installed for this combo of python & pip.

Then go into the scikit-learn source folder, checkout this branch and do:

python setup.py build_ext --inplace  # builds the compiled extension in the source folder
python setup.py develop

or alternatively:

 pip install --editable /path/to/scikit-learn

In both cases this should install scikit-learn in development mode, meaning that when you import sklearn it should use the live source code from your local git repository.

Again you can check with:

python -c "import sklearn; print(sklearn.__path__[0])"
pip show scikit-learn

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 5, 2014

In the mean time I can throw out a hypothesis: under cosine metric if the blobs are all far from zero, then most hyperplanes do not separate the data in a blob at all or much. High dimensions are weird. In this case, many points would clash even on 32 hash bits.

This is what I suspect. Maybe thresholding at np.median(X_projected[:1000], axis=0) instead of 0 might help find better splits. Otherwise, we could use random samples from X as hyperplanes instead of random Gaussian vectors, possibly combined with the median intercept.

In this case an implementation should still only look at a bounded number of candidates, thus drop accuracy, not speed, IMO.

Not sure: if samples all collide in 10 buckets (e.g. uniformly) instead of 2 ** 32, we should get an approximately constant ~10 / n_estimators x speedup w.r.t. brute force and a therefore linear scaling.

Or it could be a plain bug, etc.

This is always a possibility.

@daniel-vainsencher
Copy link
Copy Markdown

Sorry, I cannot help with the practical investigation.

Defining data dependent hash families, to my knowledge, is more of an
art than a science and out of scope for this PR. Unless practical
testing turns up a bug, I propose to document that implementation
details limit the speed up to an application dependent constant, so YMMV.

Another relatively easy option is to go to 64 hash bits (but I would
still not do it for this PR).

Daniel

On 11/05/2014 05:08 PM, Olivier Grisel wrote:

In the mean time I can throw out a hypothesis: under cosine metric
if the blobs are all far from zero, then most hyperplanes do not
separate the data in a blob at all or much. High dimensions are
weird. In this case, many points would clash even on 32 hash bits.

This is what I suspect. Maybe thresholding at
|np.median(X_projected[:1000], axis=0)| instead of |0| might help find
better splits. Otherwise, we could use random samples from X as
hyperplanes instead of random Gaussian vectors, possibly combined with
the median intercept.

In this case an implementation should still only look at a bounded
number of candidates, thus drop accuracy, not speed, IMO.

Not sure: if samples all collide in 10 buckets (e.g. uniformly) instead
of |2 ** 32|, we should get an approximately constant ~10 / n_estimators
x speedup w.r.t. brute force and a therefore linear scaling.

Or it could be a plain bug, etc.

This is always a possibility.


Reply to this email directly or view it on GitHub
#3304 (comment).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*dimension

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 14, 2014

@maheshakya Please consider the following doc improvement: maheshakya#10

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 14, 2014

Defining data dependent hash families, to my knowledge, is more of an art than a science and out of scope for this PR.

Agreed. Although it might be interesting to consider the refactoring proposed by @jnothman as a mean to make it possible for the users to implement such data-depenent hash families by them-selves. In particular I would like to experiment with combining RandomTreesEmbedding and sparse random projections.

Unless practical testing turns up a bug, I propose to document that implementation details limit the speed up to an application dependent constant, so YMMV.

Yes after more testing with tweaking the parameters of the scalability example I think that the observed profile is ok. I updated the doc to reflect my findings in the changes I proposed in maheshakya#10.

Another relatively easy option is to go to 64 hash bits (but I would still not do it for this PR).

I have tried (this is easy with a2fe63a, just replace '>u4' by '>u8'): it does not seem to impact the scalability profile significantly but using 64 bit integers has a performance overhead. Let's stick to 32 bit for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need an arg y=None here and in transform to stop Travis complaining

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

A test still fails because fitted X is not of a dimensionality of multiple of 8. Perhaps we should remove the value error since we are using predefined hash size MAX_HASH_SIZE = np.dtype(HASH_DTYPE).itemsize * 8. But it maybe a problem if this is being used somewhere else.
Is there a workaround for this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X need not have a dimensionality of a multiple of 8. GaussianProjections().fit_transform(X) must have a multiple of 8 features. I don't understand why the error could be firing if n_components is defined to be a multiple of 8.

Do not remove the exception. The usage of this will change over time, and it is an important assertion, without which packbits will give us something useless (because hash boundaries will be mid-byte).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I get it. It's failing an invariance test. The easiest option is to add it to DONT_TEST in sklearn.utils.testing which may be reasonable in this case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other option is to change the default n_components in GaussianRandomProjectionsHash

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. by defining __init__

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this:

class GaussianRandomProjectionHash(ProjectionToHashMixin,
                                   GaussianRandomProjection):
    """Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
    def __init__(self,
                 n_components=8,
                 random_state=None):
        super(GaussianRandomProjectionHash, self).__init__(
            n_components=n_components,
            random_state=random_state)

But still getting the same error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you commit the code, I could see the error... or you could paste the error in more detail. which test is failing??

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the other solution you mentioned because this method fails anyway as HASH_DTYPE which is specific to this module is used in _to_hash.

@maheshakya
Copy link
Copy Markdown
Contributor Author

Is there anything else that needs to be done except applying LSH forest in regression and classification?

@jnothman
Copy link
Copy Markdown
Member

I think I need to give this another look over, then I expect it will have
my support for merge. Then you need a second supporter. In my opinion,
there are a number of extensions to this contribution, both internal to
LSHForest and in terms of its integration in
classification/regression/clustering, which can be done in later PRs, by
other contributors. If we're happy with the API, this PR should have a
feature freeze.

In any case I think this PR needs to end. It's a fairly hefty page just to
download and render in my web browser or switch tabs. If we're going to
have any more conversation, I'd consider closing this PR and opening a new
one.

On 25 November 2014 at 15:43, Maheshakya Wijewardena <
notifications@github.com> wrote:

Is there anything else that needs to be done except applying LSH forest
in regression and classification?


Reply to this email directly or view it on GitHub
#3304 (comment)
.

@maheshakya
Copy link
Copy Markdown
Contributor Author

I have done a little modification to the description of hyper parameters example as it didn't reflect the latest changes.

Please have a look at those descriptions and the documentation as well.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.07%) when pulling 2edabc8 on maheshakya:lsh_forest into 3f49cee on scikit-learn:master.

@maheshakya
Copy link
Copy Markdown
Contributor Author

New PR created #3894

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.