[MRG+1] Parallel radius neighbors by recamshak · Pull Request #10887 · scikit-learn/scikit-learn

recamshak · 2018-03-28T22:49:02Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This makes RadiusNeighborsMixin.radius_neighbors honor the n_jobs argument and split the queries among processors. This also makes query_radius GIL-free so that it is actually faster than single thread.

TODO:

fix memory leak by making the indices and distances array own the data
fix Windows 32bits issue
write tests
run some benchmark

Any other comments?

jnothman · 2018-03-28T22:52:13Z

We found in benchmarks that this did not improve runtime. Does your mileage vary?

recamshak · 2018-03-28T23:00:13Z

I haven't done a proper benchmark yet but I saw much better runtime on my laptop. I'll do a benchmark today and post the results here.

recamshak · 2018-03-29T05:24:56Z

I ran 10 times the following benchmark on a Google Cloud instance with 64 vCPUs:

from sklearn.datasets import make_blobs
from sklearn.neighbors import NearestNeighbors
from sklearn.externals.joblib import cpu_count
import time

d = make_blobs(100000, 100)[0]
nn = NearestNeighbors().fit(d)

for n_jobs in range(1, cpu_count() + 1):
    nn.n_jobs = n_jobs
    start = time.time()
    nn.radius_neighbors()
    end = time.time()
    print('{},{}'.format(n_jobs, end - start))

Although the scaling is not linear it definitely runs faster:

jnothman · 2018-03-29T06:37:42Z

maybe I'm not recalling correctly, and we never got as far as completely removing the gil. The results look great! The implementation looks good at a glance but will need a proper review. please add a test that results are unchanged.

recamshak · 2018-03-30T02:05:11Z

One thing that was not obvious to me but made a huge difference was that in _query_radius_single doing this:

-                        raise ValueError("Fatal: count out of range. "
-                                         "This should never happen.")
+                        with gil:
+                            raise ValueError("Fatal: count out of range. "
+                                             "This should never happen.")

instead of this:

-                        raise ValueError("Fatal: count out of range. "
-                                         "This should never happen.")
+                        return -1

have a completely different scaling behavior.

In both case the function is nogil and I expected that in the first case the GIL would be acquired only if the with gil: statement is reached. But as explained here any function with a with gil: statement will have to acquire the GIL on return, regardless of whether with gil: was reached or not.

recamshak · 2018-04-02T07:44:31Z

@jnothman Thank you for your comments. I think this is ready for review.

jnothman · 2018-04-02T13:05:21Z

And I wish I were available to review it, but will not be for a few weeks at least...

…

On 2 April 2018 at 17:44, Joël Billaud ***@***.***> wrote: @jnothman <https://github.com/jnothman> Thank you for your comments. I think this is ready for review. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10887 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yEYiXdA6Uiz7cn1_xVbU6Y1IzEcks5tkdbigaJpZM4S_fFL> .

TomDLT · 2018-04-05T22:21:12Z

Nice work ! The code seems reasonable, though I am a novice in C memory allocation.

As a general comment, your code could benefit from more comments, especially near non-standard function like np.PyArray_SimpleNewFromData, np.PyArray_UpdateFlags, or memcpy.

For instance, you could add comments with the equivalent python expressions:
# equivalent to: distances[i] = np_dist_arr[:counts[i]].copy()

recamshak · 2018-04-09T07:38:51Z

@TomDLT Thank you for your comment. I added some comments as you suggested.

TomDLT

LGTM
Disclaimer: I am a novice in C memory allocation.

Could you add an entry in doc/whats_new/v0.20.rst?

recamshak · 2018-04-15T23:28:38Z

@TomDLT thank you for the review. I updated the release notes.

jnothman

LGTM! Nice work, thanks

jnothman · 2018-04-16T00:07:14Z

Merge when green

Joel Billaud added 2 commits March 29, 2018 07:28

parallel radius neighbors

d0d5e68

GIL-free query_radius for faster parallel radius_neighbors

780d4f0

Joel Billaud added 9 commits March 30, 2018 13:15

Fix windows 32bits error

5595197

Fix tests on Windows

4979f19

Make the numpy arrays own the data

ccd630f

Fix memory leak

e9a5fae

Fix inappropriate

68d1c0b

Make RadiusNeighborsClassifier and RadiusNeighborsRegressor parallel

b903b9b

Add test for parallel radius neighbors

f3e5c13

Update documentation

bd1ec1e

Fix count_only case

9646491

recamshak changed the title ~~[WIP] Parallel radius neighbors~~ [MRG] Parallel radius neighbors Apr 2, 2018

Add comments around non-standard functions and memory management

2644323

TomDLT approved these changes Apr 13, 2018

View reviewed changes

TomDLT changed the title ~~[MRG] Parallel radius neighbors~~ [MRG+1] Parallel radius neighbors Apr 13, 2018

Update release notes

b6ab19e

jnothman approved these changes Apr 16, 2018

View reviewed changes

jnothman merged commit 4335199 into scikit-learn:master Apr 16, 2018

rth mentioned this pull request May 3, 2018

NearestNeighbors radius_neighbors memory leaking #11051

Closed

TomDLT mentioned this pull request May 22, 2018

[MRG] Implementation of OPTICS #1984

Closed

rth mentioned this pull request Jun 4, 2018

Test too slow: test_mean_shift.py::test_parallel #11146

Closed

TomDLT mentioned this pull request Jun 11, 2018

FEA Generalize the use of precomputed sparse distance matr… #10482

Merged

jnothman mentioned this pull request May 11, 2020

[MRG+1] ENH: Parallelize kneighbors method with multithreading #4009

Merged

thomasjpfan mentioned this pull request Jul 26, 2022

Segmentation Fault on neighbors.Balltree query method with large 'low-entropy' datasets #7192

Closed

Uh oh!

Conversation

recamshak commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Mar 28, 2018 via email

Uh oh!

recamshak commented Mar 28, 2018

Uh oh!

recamshak commented Mar 29, 2018

Uh oh!

jnothman commented Mar 29, 2018 via email

Uh oh!

recamshak commented Mar 30, 2018

Uh oh!

recamshak commented Apr 2, 2018

Uh oh!

jnothman commented Apr 2, 2018 via email

Uh oh!

TomDLT commented Apr 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

recamshak commented Apr 9, 2018

Uh oh!

TomDLT left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

recamshak commented Apr 15, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Apr 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

recamshak commented Mar 28, 2018 •

edited

Loading

TomDLT commented Apr 5, 2018 •

edited

Loading

TomDLT left a comment •

edited

Loading