[WIP] allow nearest neighbors algorithm to be an estimator by jnothman · Pull Request #3922 · scikit-learn/scikit-learn

jnothman · 2014-12-01T04:56:07Z

This works towards allowing an LSHForest (see #3894) instance as a valid value for the algorithm parameter to nearest neighbors classes (and the neighbors_algorithm parameter elsewhere) while minimising extra support code.

The idea is that it should be possible to pass as algorithm any estimator that has fit and at least one of kneighbors or radius_neighbors implemented. Until this PR, KDTree and BallTree ~~instead take the data upon construction (lacking fit) and~~ have query/query_radius with extremely similar interface, instead of using the public API names. Additionally, query_radius sends its return values in the reverse order for no apparent reason. This PR thus includes backwards-incompatible changes to the KDTree and BallTree semi-public APIs, to make them fit this mould, the main issue being not accepting data upon construction. Does this seem reasonable, or should we use deprecation with some type sniffing in the constructor?

Another change is that previously radius_neighbors would return a 2d array or a 1d array of arrays, depending on whether the number of neighbors within radius for all queries was equal, but only if algorithm='brute'. This PR changes it to alway return an array of arrays.

TODO:

placate Travis who is upset by repr differences for arrays of arrays across numpy versions
test returning array of objects from radius_neighbors
test and document estimator as value of algorithm

(One annoyance of this approach is that the metric parameters to e.g. KNeighborsClassifier go ignored when algorithm is an LSHForest or similar.)

ogrisel · 2014-12-01T08:14:58Z

Overall +1 with the spirit of this refactoring. @jakevdp you might want to have a look at this.

Another change is that previously radius_neighbors would return a 2d array or a 1d array of arrays, depending on whether the number of neighbors within radius for all queries was equal, but only if algorithm='brute'. This PR changes it to alway return an array of arrays.

Good catch, we got hit by this inconsistency when working on #3919 this WE.

jakevdp · 2014-12-01T15:07:54Z

The reason query_radius sends outputs in the reverse order is that it was following the convention set by scipy's cKDTree. That seemed to matter at the time (because we were using BallTree as a drop-in replacement for cKDTree) but I agree that now it's pretty silly.

jakevdp · 2014-12-01T15:11:40Z

One comment: I never really intended KDTree and BallTree to be estimators. I intended them to be work-horses to be used by estimators. That's why they don't have the classic fit, predict, etc. methods, and that's why I created the KNeighbors and RadiusNeighbors classes that use them within the standard sklearn interface. Again, this was because originally the KDTree functionality was supplied by scipy's cKDTree.

I'm not sure it's worth doing a backward-incompatible change to these classes. I know people who use their "raw" version whose code would break with these changes, and the benefit of making the change seems pretty scant. People who want to use the sklearn interface can just use KNeighbors and RadiusNeighbors.

jnothman · 2014-12-01T21:21:53Z

I realise that BallTree and KDTree aren't intended to be estimators, merely data structures. But something like LSHForest which has a number of parameters etc. needs to be supported in place of BallTree and KDTree, so it makes sense that they have the same interface. Given its need to support set_params etc, it's clear that LSHForest also needs fit(). Additionally, having kneighbors and radius_neighbors on a containing estimator with identical function to query and query_radius is a bit confusing.

I'm happy to make the changes backwards compatible, deprecating the old API.

jnothman · 2014-12-01T21:23:32Z

Ah. Thanks for the clarification of the consistency intended with scipy.spatial.

jakevdp · 2014-12-02T00:42:34Z

My experience is that deprecation cycles only make us feel better, not the users. I've seen this in trying to support astroML, which relies on scikit-learn, and therefore must account for a wide, wide range of versions that users have on their system. Every deprecation cycle leads to a LOT of work for package maintainers whose packages rely on sklearn. I guess I'm just not convinced that the benefit of breaking the old behavior here outweighs the costs, deprecation cycle or no.

How about this: we can let *Neighbors* estimators accept either a string or an object for their method argument: if a string, then it can use BallTree or KDTree. If another object, then proceed in the way you have in mind. That gives all the flexibility you're striving for, without adding any API changes to the library. To remove lots of special cases within the code, you could create a lightweight wrapper object for BallTree and KDTree that conforms to the API you have in mind for LSHforest. How does that sound?

jnothman · 2014-12-02T02:16:18Z

That's a fair statement. I guess I'm okay with not making changes to KDTree/BallTree. I just think the inconsistency is awkward (particularly with thing like return value reversal) and makes the code unnecessarily complex.

jnothman · 2014-12-02T05:50:18Z

[Aside:

Btw, the suggestion that the return value order for query_radius was influenced by cKDTree doesn't make sense from at least the current version of scipy.spatial:

cKDTree.query returns dist, ind
cKDTree.query_ball_point returns ind
BinaryTree.query_radius returns ind, dist (iff return_distances)
BinaryTree.query returns dist, ind (iff return_distances)
NearestNeighbors.{kneighbors,radius_neighbors} returns dist, ind

At least renaming query_radius to radius_neighbors, even without adding fit, would provide an opportunity to remedy this anomaly.

]

GaelVaroquaux · 2014-12-02T05:57:56Z

My experience is that deprecation cycles only make us feel better, not
the users. I've seen this with trying to support astroML, which relies
on scikit-learn, and therefore must account for a wide, wide range of
versions that users have on their system. Every deprecation cycle leads
to a LOT of work for package maintainers whose packages rely on
sklearn. I guess I'm just not convinced that the benefit of breaking
the old behavior here outweighs the costs, deprecation cycle or no.

Well, I guess that it is on a case by case basis. But I do agree with you
that deprecations are something very very costly for users.

jnothman · 2014-12-02T09:41:04Z

@maheshakya, we may still be able to offer a string-based initialisation of the LSHForest without any promises that it'll do as good a job as importing from a different module if necessary and constructing your own. In any case, I don't think there's any dispute over needing to support custom approximations to nearest neighbor search as widely as possible.

But I do agree with you that deprecations are something very very costly for users.

This is an important fact to keep in mind. Previously we have excused some API changes on the basis of we're still pre-v 1.0. I had thought it might also be okay to change KDTree and BallTree because they are mostly accessed indirectly in their primary uses in scikit-learn, but I had also underestimated how long they have been around. I am okay with not changing these, but think the calling code could be much simplified by renaming query to kneighbors and query_radius to radius_neighbors and sorting out the return value ordering of the latter.

jakevdp · 2014-12-03T16:35:06Z

I think the calling code could be much simplified by renaming query to kneighbors and query_radius to radius_neighbors and sorting out the return value ordering of the latter.

That sounds fine to me. We could mark query and query_radius as deprecated in the doc strings.

jnothman · 2014-12-03T23:16:45Z

Okay, it sounds like we have a happy middle ground. I'll make it happen,
perhaps not for a couple of weeks!

On 4 December 2014 at 03:35, Jake Vanderplas notifications@github.com
wrote:

I think the calling code could be much simplified by renaming query to
kneighbors and query_radius to radius_neighbors and sorting out the return
value ordering of the latter.

That sounds fine to me. We could mark query and query_radius as
deprecated in the doc strings.

—
Reply to this email directly or view it on GitHub
#3922 (comment)
.

jakevdp · 2014-12-05T00:49:34Z

One comment that just occurred to me: I've recently been looking through the stats literature, and noticed that people tend to refer to "k neighbors" vs "epsilon neighbors", rather than "radius neighbors". Has anyone else seen that? If that's a common nomenclature, then perhaps the new method should be named accordingly.

jnothman · 2014-12-05T00:56:48Z

and scipy describes them as ball queries. Still, the point is to provide
internal consistency with the names already provided by the nearest
neighbor estimators. I think changing from radius_neighbors adds very
little.

On 5 December 2014 at 11:49, Jake Vanderplas notifications@github.com
wrote:

One comment that just occurred to me: I've recently been looking through
the stats literature, and noticed that people tend to refer to "k
neighbors" vs "epsilon neighbors", rather than "radius neighbors". Has
anyone else seen that? If that's a common nomenclature, then perhaps the
new method should be named accordingly.

—
Reply to this email directly or view it on GitHub
#3922 (comment)
.

jakevdp · 2014-12-05T06:21:53Z

Makes sense.

GaelVaroquaux · 2014-12-05T07:27:54Z

people tend to refer to "k neighbors" vs "epsilon neighbors", rather
than "radius neighbors". Has anyone else seen that?

I've seen that, but I find that it is way more obscure for someone
outside the field.

ogrisel · 2014-12-05T09:50:10Z

+1 radius neighbors query is more explicit to me. We should put the alternative names (ball queries and epsilon queries in the docstring and the narrative doc for googlability though).

coveralls · 2014-12-21T11:29:23Z

Coverage increased (+0.01%) when pulling d454eed on jnothman:neighbors into cbf6c7e on scikit-learn:master.

maheshakya · 2014-12-21T11:50:11Z

It seems array_of_arrays method doesn't do its job in NUMPY_VERSION="1.6.2"

jnothman · 2014-12-21T11:56:51Z

No, just change in __repr__

coveralls · 2014-12-21T12:57:57Z

Coverage increased (+0.01%) when pulling d5a91e8 on jnothman:neighbors into 8a197e1 on scikit-learn:master.

jnothman · 2014-12-22T02:43:20Z

I've got this mostly done... But given that it means the metric and metric_params and p are ignored, maybe this is the wrong design (although the need for consistent naming of kneighbors and radius_neighbors remains).

Instead of allowing algorithm=LSHForest(), we could more explicitly support algorithm='lshforest' (or algorithm='approximate') such that the metric and its parameters are passed onto LSHForest as they are for KDTree and BallTree. In addition, an algorithm_params parameter would be required to tune the approximation.

Provide methods kneighbors and radius_neighbors on BinaryTree classes. Also, ensure array of arrays returned from radius_neighbors.

coveralls · 2015-01-01T10:34:22Z

Coverage increased (+0.01%) when pulling 6bddd96 on jnothman:neighbors into d5c72f3 on scikit-learn:master.

jnothman · 2015-01-05T03:36:37Z

Btw, @jakevdp, a thought: would it be possible for BinaryTree to support incremental updates? In that case, supporting fit makes a lot of sense in anticipation of partial_fit.

jnothman · 2015-01-05T03:45:14Z

(But a quick search suggests that partial_fit isn't a real prospect for BinaryTree.)

jakevdp · 2015-01-05T08:48:53Z

No, incremental updates are not really possible with the memory model it uses. It would be more efficient just to re-build the tree on the new dataset.

nelson-liu · 2016-03-17T18:56:24Z

@jnothman what's the status on this PR?

jnothman · 2016-03-18T00:20:34Z

@jnothman what's the status on this PR?

Good question. I think it got a bit disrupted by Real Life, my dissatisfaction with LSHForest performance in practice (particularly given that it only supports cosine sim), etc. I also was not especially happy with how the API looks after these changes. Perhaps it's worth completing soon, whether to enable LSHForest or other ANN implementations.

rth · 2017-04-04T20:55:03Z

@jnothman It is very tempting to try other ANN implementations (cf. https://github.com/erikbern/ann-benchmarks ) as backend of NearestNeighbors...
Are you still planning to work on this PR or could I give it a try?

jnothman · 2017-04-04T23:08:34Z

Go ahead, please @rth! Sorry I've not got to it.

jnothman mentioned this pull request Dec 1, 2014

[MRG+1] Locality Sensitive Hashing for approximate nearest neighbor search #3894

Closed

jnothman force-pushed the neighbors branch 3 times, most recently from 5e9d607 to b924ab9 Compare December 21, 2014 04:08

jnothman force-pushed the neighbors branch from a0ef83e to d5a91e8 Compare December 21, 2014 12:48

jnothman changed the title ~~[WIP] allow nearest neighbors algorithm to be an estimator~~ [MRG] allow nearest neighbors algorithm to be an estimator Dec 25, 2014

jnothman force-pushed the neighbors branch from 2e7d58a to eb1d4d6 Compare December 25, 2014 13:30

jnothman changed the title ~~[MRG] allow nearest neighbors algorithm to be an estimator~~ [WIP] allow nearest neighbors algorithm to be an estimator Dec 25, 2014

jnothman force-pushed the neighbors branch from eb1d4d6 to cdbb99f Compare January 1, 2015 10:20

jnothman added 5 commits January 1, 2015 21:24

FIX/MAINT towards custom estimator as nearest neighbors algorithm

debee30

Provide methods kneighbors and radius_neighbors on BinaryTree classes. Also, ensure array of arrays returned from radius_neighbors.

DOC/TST neighbor estimator allowed as algorithm

d7edda3

FIX doctests failing on Numpy 1.6

bba4ce7

DOC/TST custom algorithm in neighbors and dbscan

2e8c0b2

Correct sparse restriction

6bddd96

jnothman force-pushed the neighbors branch from cdbb99f to 6bddd96 Compare January 1, 2015 10:24

This was referenced Jun 6, 2017

Deprecate LSHForest #8996

Closed

[WIP] allow nearest neighbors algorithm to be an estimator (v2) #8999

Closed

TomDLT mentioned this pull request Jan 12, 2018

Toward a consistent API for NearestNeighbors & co #10463

Closed

TomDLT mentioned this pull request May 23, 2019

FEA Generalize the use of precomputed sparse distance matr… #10482

Merged

amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019

thomasjpfan closed this in #10482 Sep 18, 2019

Uh oh!

Conversation

jnothman commented Dec 1, 2014

Uh oh!

ogrisel commented Dec 1, 2014

Uh oh!

jakevdp commented Dec 1, 2014

Uh oh!

jakevdp commented Dec 1, 2014

Uh oh!

jnothman commented Dec 1, 2014

Uh oh!

jnothman commented Dec 1, 2014

Uh oh!

jakevdp commented Dec 2, 2014

Uh oh!

jnothman commented Dec 2, 2014

Uh oh!

jnothman commented Dec 2, 2014

Uh oh!

GaelVaroquaux commented Dec 2, 2014

Uh oh!

jnothman commented Dec 2, 2014

Uh oh!

jakevdp commented Dec 3, 2014

Uh oh!

jnothman commented Dec 3, 2014

Uh oh!

jakevdp commented Dec 5, 2014

Uh oh!

jnothman commented Dec 5, 2014

Uh oh!

jakevdp commented Dec 5, 2014

Uh oh!

GaelVaroquaux commented Dec 5, 2014

Uh oh!

ogrisel commented Dec 5, 2014

Uh oh!

coveralls commented Dec 21, 2014

Uh oh!

maheshakya commented Dec 21, 2014

Uh oh!

jnothman commented Dec 21, 2014

Uh oh!

coveralls commented Dec 21, 2014

Uh oh!

jnothman commented Dec 22, 2014

Uh oh!

coveralls commented Jan 1, 2015

Uh oh!

jnothman commented Jan 5, 2015

Uh oh!

jnothman commented Jan 5, 2015

Uh oh!

jakevdp commented Jan 5, 2015

Uh oh!

nelson-liu commented Mar 17, 2016

Uh oh!

jnothman commented Mar 18, 2016

Uh oh!

rth commented Apr 4, 2017

Uh oh!

jnothman commented Apr 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants