[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC)#3304
[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC)#3304maheshakya wants to merge 119 commits intoscikit-learn:masterfrom
Conversation
Rewritten _bisect_right with numpy searchsorted. Updated examples.
sklearn/neighbors/lsh_forest.py
Outdated
There was a problem hiding this comment.
Please store all arguments to the constructor just as they are passed in. Please see other uses of check_random_state and where it is called.
Replaced numpy searchsorted in _bisect_right with the previous version.
|
@jnothman, I'm adding insert operation into this data structure. I suppose that could help in incremental learning. |
Insert operation allows to insert new data points into the fitted set of trees. (Can be used in incremental learning? ) Changed parameter m to n_neighbors. Changed parameter m to n_neighbors.
|
Did I say something about incremental learning? |
|
Yes. |
|
Ahh =) That was merely a ping. On 23 June 2014 09:58, maheshakya notifications@github.com wrote:
|
There was a problem hiding this comment.
You might want to have a look at the random projection module.
There was a problem hiding this comment.
A random sign projection transformer could have its place in that module.
sklearn/neighbors/lsh_forest.py
Outdated
There was a problem hiding this comment.
Try this first: http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html
As a second optimisation, consider how it might be possible to compute all the trees (and so on) in one numpy operation etc, to get rid of the previous loop. Dot products are your friend!
There was a problem hiding this comment.
Yes, If all the hash functions are computed in advance, I think it's possible to get rid of the loop. I'll give it a try.
For _bisect_right() function, a transformed x is passed. The transformation will replace the characters after h hash length with '1's. Used random_projections module. GuassianRandomProjections in random_projections module is used to perform the hashing for Random projections LSH method.
GuassianRandomProjections in random_projections module is used to perform the hashing for Random projections LSH method.
Removed lshashinng in feature extraction and add that funtionality in the LSHForest class. If other hashing algorithms are to be implemented, a separate lshashing class may be required.
Honestly it's better to be explicit and install the version you want (or learn to use virtualenv if you want to switch between many different versions of a project on the same host). To find the folder of the version that gets picked up when you import Check that this matches what pip sees: If not, it means that the Uninstall previous versions of scikit-learn with: If you don't have pip installed you can delete the folder returned by Then do: to check that you no longer have any version of scikit-learn installed for this combo of python & pip. Then go into the scikit-learn source folder, checkout this branch and do: or alternatively: In both cases this should install scikit-learn in development mode, meaning that when you import Again you can check with: |
This is what I suspect. Maybe thresholding at
Not sure: if samples all collide in 10 buckets (e.g. uniformly) instead of
This is always a possibility. |
|
Sorry, I cannot help with the practical investigation. Defining data dependent hash families, to my knowledge, is more of an Another relatively easy option is to go to 64 hash bits (but I would Daniel On 11/05/2014 05:08 PM, Olivier Grisel wrote:
|
sklearn/neighbors/approximate.py
Outdated
|
@maheshakya Please consider the following doc improvement: maheshakya#10 |
Agreed. Although it might be interesting to consider the refactoring proposed by @jnothman as a mean to make it possible for the users to implement such data-depenent hash families by them-selves. In particular I would like to experiment with combining RandomTreesEmbedding and sparse random projections.
Yes after more testing with tweaking the parameters of the scalability example I think that the observed profile is ok. I updated the doc to reflect my findings in the changes I proposed in maheshakya#10.
I have tried (this is easy with a2fe63a, just replace |
Various improvements for LSHForest
FIX broken doctest
sklearn/neighbors/approximate.py
Outdated
There was a problem hiding this comment.
You need an arg y=None here and in transform to stop Travis complaining
There was a problem hiding this comment.
Thanks.
A test still fails because fitted X is not of a dimensionality of multiple of 8. Perhaps we should remove the value error since we are using predefined hash size MAX_HASH_SIZE = np.dtype(HASH_DTYPE).itemsize * 8. But it maybe a problem if this is being used somewhere else.
Is there a workaround for this?
There was a problem hiding this comment.
X need not have a dimensionality of a multiple of 8. GaussianProjections().fit_transform(X) must have a multiple of 8 features. I don't understand why the error could be firing if n_components is defined to be a multiple of 8.
Do not remove the exception. The usage of this will change over time, and it is an important assertion, without which packbits will give us something useless (because hash boundaries will be mid-byte).
There was a problem hiding this comment.
Oh, I get it. It's failing an invariance test. The easiest option is to add it to DONT_TEST in sklearn.utils.testing which may be reasonable in this case.
There was a problem hiding this comment.
the other option is to change the default n_components in GaussianRandomProjectionsHash
There was a problem hiding this comment.
I did this:
class GaussianRandomProjectionHash(ProjectionToHashMixin,
GaussianRandomProjection):
"""Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
def __init__(self,
n_components=8,
random_state=None):
super(GaussianRandomProjectionHash, self).__init__(
n_components=n_components,
random_state=random_state)
But still getting the same error.
There was a problem hiding this comment.
If you commit the code, I could see the error... or you could paste the error in more detail. which test is failing??
There was a problem hiding this comment.
I think we should use the other solution you mentioned because this method fails anyway as HASH_DTYPE which is specific to this module is used in _to_hash.
|
Is there anything else that needs to be done except applying LSH forest in regression and classification? |
|
I think I need to give this another look over, then I expect it will have In any case I think this PR needs to end. It's a fairly hefty page just to On 25 November 2014 at 15:43, Maheshakya Wijewardena <
|
|
I have done a little modification to the description of hyper parameters example as it didn't reflect the latest changes. Please have a look at those descriptions and the documentation as well. |
|
New PR created #3894 |
No description provided.