Skip to content

[WIP] allow nearest neighbors algorithm to be an estimator#3922

Closed
jnothman wants to merge 5 commits intoscikit-learn:masterfrom
jnothman:neighbors
Closed

[WIP] allow nearest neighbors algorithm to be an estimator#3922
jnothman wants to merge 5 commits intoscikit-learn:masterfrom
jnothman:neighbors

Conversation

@jnothman
Copy link
Copy Markdown
Member

@jnothman jnothman commented Dec 1, 2014

This works towards allowing an LSHForest (see #3894) instance as a valid value for the algorithm parameter to nearest neighbors classes (and the neighbors_algorithm parameter elsewhere) while minimising extra support code.

The idea is that it should be possible to pass as algorithm any estimator that has fit and at least one of kneighbors or radius_neighbors implemented. Until this PR, KDTree and BallTree instead take the data upon construction (lacking fit) and have query/query_radius with extremely similar interface, instead of using the public API names. Additionally, query_radius sends its return values in the reverse order for no apparent reason. This PR thus includes backwards-incompatible changes to the KDTree and BallTree semi-public APIs, to make them fit this mould, the main issue being not accepting data upon construction. Does this seem reasonable, or should we use deprecation with some type sniffing in the constructor?

Another change is that previously radius_neighbors would return a 2d array or a 1d array of arrays, depending on whether the number of neighbors within radius for all queries was equal, but only if algorithm='brute'. This PR changes it to alway return an array of arrays.

TODO:

  • placate Travis who is upset by repr differences for arrays of arrays across numpy versions
  • test returning array of objects from radius_neighbors
  • test and document estimator as value of algorithm

(One annoyance of this approach is that the metric parameters to e.g. KNeighborsClassifier go ignored when algorithm is an LSHForest or similar.)

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Dec 1, 2014

Overall +1 with the spirit of this refactoring. @jakevdp you might want to have a look at this.

Another change is that previously radius_neighbors would return a 2d array or a 1d array of arrays, depending on whether the number of neighbors within radius for all queries was equal, but only if algorithm='brute'. This PR changes it to alway return an array of arrays.

Good catch, we got hit by this inconsistency when working on #3919 this WE.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Dec 1, 2014

The reason query_radius sends outputs in the reverse order is that it was following the convention set by scipy's cKDTree. That seemed to matter at the time (because we were using BallTree as a drop-in replacement for cKDTree) but I agree that now it's pretty silly.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Dec 1, 2014

One comment: I never really intended KDTree and BallTree to be estimators. I intended them to be work-horses to be used by estimators. That's why they don't have the classic fit, predict, etc. methods, and that's why I created the KNeighbors and RadiusNeighbors classes that use them within the standard sklearn interface. Again, this was because originally the KDTree functionality was supplied by scipy's cKDTree.

I'm not sure it's worth doing a backward-incompatible change to these classes. I know people who use their "raw" version whose code would break with these changes, and the benefit of making the change seems pretty scant. People who want to use the sklearn interface can just use KNeighbors and RadiusNeighbors.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Dec 1, 2014

I realise that BallTree and KDTree aren't intended to be estimators, merely data structures. But something like LSHForest which has a number of parameters etc. needs to be supported in place of BallTree and KDTree, so it makes sense that they have the same interface. Given its need to support set_params etc, it's clear that LSHForest also needs fit(). Additionally, having kneighbors and radius_neighbors on a containing estimator with identical function to query and query_radius is a bit confusing.

I'm happy to make the changes backwards compatible, deprecating the old API.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Dec 1, 2014

Ah. Thanks for the clarification of the consistency intended with scipy.spatial.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Dec 2, 2014

My experience is that deprecation cycles only make us feel better, not the users. I've seen this in trying to support astroML, which relies on scikit-learn, and therefore must account for a wide, wide range of versions that users have on their system. Every deprecation cycle leads to a LOT of work for package maintainers whose packages rely on sklearn. I guess I'm just not convinced that the benefit of breaking the old behavior here outweighs the costs, deprecation cycle or no.

How about this: we can let *Neighbors* estimators accept either a string or an object for their method argument: if a string, then it can use BallTree or KDTree. If another object, then proceed in the way you have in mind. That gives all the flexibility you're striving for, without adding any API changes to the library. To remove lots of special cases within the code, you could create a lightweight wrapper object for BallTree and KDTree that conforms to the API you have in mind for LSHforest. How does that sound?

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Dec 2, 2014

That's a fair statement. I guess I'm okay with not making changes to KDTree/BallTree. I just think the inconsistency is awkward (particularly with thing like return value reversal) and makes the code unnecessarily complex.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Dec 2, 2014

[Aside:

Btw, the suggestion that the return value order for query_radius was influenced by cKDTree doesn't make sense from at least the current version of scipy.spatial:

  • cKDTree.query returns dist, ind
  • cKDTree.query_ball_point returns ind
  • BinaryTree.query_radius returns ind, dist (iff return_distances)
  • BinaryTree.query returns dist, ind (iff return_distances)
  • NearestNeighbors.{kneighbors,radius_neighbors} returns dist, ind

At least renaming query_radius to radius_neighbors, even without adding fit, would provide an opportunity to remedy this anomaly.

]

@GaelVaroquaux
Copy link
Copy Markdown
Member

My experience is that deprecation cycles only make us feel better, not
the users. I've seen this with trying to support astroML, which relies
on scikit-learn, and therefore must account for a wide, wide range of
versions that users have on their system. Every deprecation cycle leads
to a LOT of work for package maintainers whose packages rely on
sklearn. I guess I'm just not convinced that the benefit of breaking
the old behavior here outweighs the costs, deprecation cycle or no.

Well, I guess that it is on a case by case basis. But I do agree with you
that deprecations are something very very costly for users.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Dec 2, 2014

@maheshakya, we may still be able to offer a string-based initialisation of the LSHForest without any promises that it'll do as good a job as importing from a different module if necessary and constructing your own. In any case, I don't think there's any dispute over needing to support custom approximations to nearest neighbor search as widely as possible.

But I do agree with you that deprecations are something very very costly for users.

This is an important fact to keep in mind. Previously we have excused some API changes on the basis of we're still pre-v 1.0. I had thought it might also be okay to change KDTree and BallTree because they are mostly accessed indirectly in their primary uses in scikit-learn, but I had also underestimated how long they have been around. I am okay with not changing these, but think the calling code could be much simplified by renaming query to kneighbors and query_radius to radius_neighbors and sorting out the return value ordering of the latter.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Dec 3, 2014

I think the calling code could be much simplified by renaming query to kneighbors and query_radius to radius_neighbors and sorting out the return value ordering of the latter.

That sounds fine to me. We could mark query and query_radius as deprecated in the doc strings.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Dec 3, 2014

Okay, it sounds like we have a happy middle ground. I'll make it happen,
perhaps not for a couple of weeks!

On 4 December 2014 at 03:35, Jake Vanderplas notifications@github.com
wrote:

I think the calling code could be much simplified by renaming query to
kneighbors and query_radius to radius_neighbors and sorting out the return
value ordering of the latter.

That sounds fine to me. We could mark query and query_radius as
deprecated in the doc strings.


Reply to this email directly or view it on GitHub
#3922 (comment)
.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Dec 5, 2014

One comment that just occurred to me: I've recently been looking through the stats literature, and noticed that people tend to refer to "k neighbors" vs "epsilon neighbors", rather than "radius neighbors". Has anyone else seen that? If that's a common nomenclature, then perhaps the new method should be named accordingly.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Dec 5, 2014

and scipy describes them as ball queries. Still, the point is to provide
internal consistency with the names already provided by the nearest
neighbor estimators. I think changing from radius_neighbors adds very
little.

On 5 December 2014 at 11:49, Jake Vanderplas notifications@github.com
wrote:

One comment that just occurred to me: I've recently been looking through
the stats literature, and noticed that people tend to refer to "k
neighbors" vs "epsilon neighbors", rather than "radius neighbors". Has
anyone else seen that? If that's a common nomenclature, then perhaps the
new method should be named accordingly.


Reply to this email directly or view it on GitHub
#3922 (comment)
.

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Dec 5, 2014

Makes sense.

@GaelVaroquaux
Copy link
Copy Markdown
Member

people tend to refer to "k neighbors" vs "epsilon neighbors", rather
than "radius neighbors". Has anyone else seen that?

I've seen that, but I find that it is way more obscure for someone
outside the field.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Dec 5, 2014

+1 radius neighbors query is more explicit to me. We should put the alternative names (ball queries and epsilon queries in the docstring and the narrative doc for googlability though).

@jnothman jnothman force-pushed the neighbors branch 3 times, most recently from 5e9d607 to b924ab9 Compare December 21, 2014 04:08
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.01%) when pulling d454eed on jnothman:neighbors into cbf6c7e on scikit-learn:master.

@maheshakya
Copy link
Copy Markdown
Contributor

It seems array_of_arrays method doesn't do its job in NUMPY_VERSION="1.6.2"

@jnothman
Copy link
Copy Markdown
Member Author

No, just change in __repr__

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.01%) when pulling d5a91e8 on jnothman:neighbors into 8a197e1 on scikit-learn:master.

@jnothman
Copy link
Copy Markdown
Member Author

I've got this mostly done... But given that it means the metric and metric_params and p are ignored, maybe this is the wrong design (although the need for consistent naming of kneighbors and radius_neighbors remains).

Instead of allowing algorithm=LSHForest(), we could more explicitly support algorithm='lshforest' (or algorithm='approximate') such that the metric and its parameters are passed onto LSHForest as they are for KDTree and BallTree. In addition, an algorithm_params parameter would be required to tune the approximation.

@jnothman jnothman changed the title [WIP] allow nearest neighbors algorithm to be an estimator [MRG] allow nearest neighbors algorithm to be an estimator Dec 25, 2014
@jnothman jnothman changed the title [MRG] allow nearest neighbors algorithm to be an estimator [WIP] allow nearest neighbors algorithm to be an estimator Dec 25, 2014
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.01%) when pulling 6bddd96 on jnothman:neighbors into d5c72f3 on scikit-learn:master.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Jan 5, 2015

Btw, @jakevdp, a thought: would it be possible for BinaryTree to support incremental updates? In that case, supporting fit makes a lot of sense in anticipation of partial_fit.

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Jan 5, 2015

(But a quick search suggests that partial_fit isn't a real prospect for BinaryTree.)

@jakevdp
Copy link
Copy Markdown
Member

jakevdp commented Jan 5, 2015

No, incremental updates are not really possible with the memory model it uses. It would be more efficient just to re-build the tree on the new dataset.

@nelson-liu
Copy link
Copy Markdown
Contributor

@jnothman what's the status on this PR?

@jnothman
Copy link
Copy Markdown
Member Author

@jnothman what's the status on this PR?

Good question. I think it got a bit disrupted by Real Life, my dissatisfaction with LSHForest performance in practice (particularly given that it only supports cosine sim), etc. I also was not especially happy with how the API looks after these changes. Perhaps it's worth completing soon, whether to enable LSHForest or other ANN implementations.

@rth
Copy link
Copy Markdown
Member

rth commented Apr 4, 2017

@jnothman It is very tempting to try other ANN implementations (cf. https://github.com/erikbern/ann-benchmarks ) as backend of NearestNeighbors...
Are you still planning to work on this PR or could I give it a try?

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Apr 4, 2017

Go ahead, please @rth! Sorry I've not got to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Superseded PR has been replace by a newer PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants