[WIP] allow nearest neighbors algorithm to be an estimator#3922
[WIP] allow nearest neighbors algorithm to be an estimator#3922jnothman wants to merge 5 commits intoscikit-learn:masterfrom
Conversation
|
Overall +1 with the spirit of this refactoring. @jakevdp you might want to have a look at this.
Good catch, we got hit by this inconsistency when working on #3919 this WE. |
|
The reason |
|
One comment: I never really intended I'm not sure it's worth doing a backward-incompatible change to these classes. I know people who use their "raw" version whose code would break with these changes, and the benefit of making the change seems pretty scant. People who want to use the sklearn interface can just use |
|
I realise that I'm happy to make the changes backwards compatible, deprecating the old API. |
|
Ah. Thanks for the clarification of the consistency intended with |
|
My experience is that deprecation cycles only make us feel better, not the users. I've seen this in trying to support astroML, which relies on scikit-learn, and therefore must account for a wide, wide range of versions that users have on their system. Every deprecation cycle leads to a LOT of work for package maintainers whose packages rely on sklearn. I guess I'm just not convinced that the benefit of breaking the old behavior here outweighs the costs, deprecation cycle or no. How about this: we can let |
|
That's a fair statement. I guess I'm okay with not making changes to |
|
[Aside: Btw, the suggestion that the return value order for
At least renaming ] |
Well, I guess that it is on a case by case basis. But I do agree with you |
|
@maheshakya, we may still be able to offer a string-based initialisation of the LSHForest without any promises that it'll do as good a job as importing from a different module if necessary and constructing your own. In any case, I don't think there's any dispute over needing to support custom approximations to nearest neighbor search as widely as possible.
This is an important fact to keep in mind. Previously we have excused some API changes on the basis of we're still pre-v 1.0. I had thought it might also be okay to change |
That sounds fine to me. We could mark |
|
Okay, it sounds like we have a happy middle ground. I'll make it happen, On 4 December 2014 at 03:35, Jake Vanderplas notifications@github.com
|
|
One comment that just occurred to me: I've recently been looking through the stats literature, and noticed that people tend to refer to "k neighbors" vs "epsilon neighbors", rather than "radius neighbors". Has anyone else seen that? If that's a common nomenclature, then perhaps the new method should be named accordingly. |
|
and scipy describes them as ball queries. Still, the point is to provide On 5 December 2014 at 11:49, Jake Vanderplas notifications@github.com
|
|
Makes sense. |
I've seen that, but I find that it is way more obscure for someone |
|
+1 radius neighbors query is more explicit to me. We should put the alternative names (ball queries and epsilon queries in the docstring and the narrative doc for googlability though). |
5e9d607 to
b924ab9
Compare
|
It seems |
|
No, just change in |
|
I've got this mostly done... But given that it means the Instead of allowing |
Provide methods kneighbors and radius_neighbors on BinaryTree classes. Also, ensure array of arrays returned from radius_neighbors.
|
Btw, @jakevdp, a thought: would it be possible for |
|
(But a quick search suggests that |
|
No, incremental updates are not really possible with the memory model it uses. It would be more efficient just to re-build the tree on the new dataset. |
|
@jnothman what's the status on this PR? |
Good question. I think it got a bit disrupted by Real Life, my dissatisfaction with |
|
@jnothman It is very tempting to try other ANN implementations (cf. https://github.com/erikbern/ann-benchmarks ) as backend of NearestNeighbors... |
|
Go ahead, please @rth! Sorry I've not got to it. |
This works towards allowing an
LSHForest(see #3894) instance as a valid value for thealgorithmparameter to nearest neighbors classes (and theneighbors_algorithmparameter elsewhere) while minimising extra support code.The idea is that it should be possible to pass as
algorithmany estimator that hasfitand at least one ofkneighborsorradius_neighborsimplemented. Until this PR,KDTreeandBallTreeinstead take the data upon construction (lackinghavefit) andquery/query_radiuswith extremely similar interface, instead of using the public API names. Additionally,query_radiussends its return values in the reverse order for no apparent reason.This PR thus includes backwards-incompatible changes to theKDTreeandBallTreesemi-public APIs, to make them fit this mould, the main issue being not accepting data upon construction. Does this seem reasonable, or should we use deprecation with some type sniffing in the constructor?Another change is that previously
radius_neighborswould return a 2d array or a 1d array of arrays, depending on whether the number of neighbors within radius for all queries was equal, but only ifalgorithm='brute'. This PR changes it to alway return an array of arrays.TODO:
reprdifferences for arrays of arrays across numpy versionsradius_neighborsalgorithm(One annoyance of this approach is that the metric parameters to e.g.
KNeighborsClassifiergo ignored whenalgorithmis anLSHForestor similar.)