ENH Release GIL in DistanceMetric when validating data by webber26232 · Pull Request #17038 · scikit-learn/scikit-learn

webber26232 · 2020-04-25T17:56:56Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Remove GIL acquiring in HaversineDistance: [sklearn.neighbors._dist_metrics.HaversineDistance] to enable parallel computing.
Move input dimension validation into BinaryTree.

Any other comments?

Not sure if hard coding input dimension in BinaryTree is the best solution. Maybe all DistanceMetric need a validate_input() function.

upstream

Merge from upstream

Merge master

rth · 2020-04-25T18:44:03Z

Thanks @webber26232 ! Do you actually see a difference in performance with multithreading after this change? The with gil should never be entered as long as size == 2: and otherwise it would just error.

webber26232 · 2020-04-25T18:58:07Z

Thanks @webber26232 ! Do you actually see a difference in performance with multithreading after this change? The with gil should never be entered as long as size == 2: and otherwise it would just error.

Yes I saw a reasonable parallel performance improvement after moving with gil: and raise ValueError statements out of the cdef functions. Not sure why Cython doesn't handle GIL releasing appropriately here.

And it seems like the validation has to be performed here. May need another solution.

rth · 2020-04-25T19:25:34Z

Hmm, could you please provide an example script where you see the impact? If Cython doesn't indeed handle this well, it would have impact elsewhere as well as we frequently use conditional with gil in Cython code in general.

webber26232 · 2020-04-26T03:51:01Z

Hmm, could you please provide an example script where you see the impact? If Cython doesn't indeed handle this well, it would have impact elsewhere as well as we frequently use conditional with gil in Cython code in general.

You can try the scripts below.

I got the output on my machine:

Conditionally acquire GIL (n_jobs=1) takes 1.292s
Conditionally acquire GIL (n_jobs=2) takes 6.811s
Conditionally acquire GIL (n_jobs=3) takes 17.586s
Conditionally acquire GIL (n_jobs=4) takes 22.653s
No GIL (n_jobs=1) takes 0.433s
No GIL (n_jobs=2) takes 0.241s
No GIL (n_jobs=3) takes 0.201s
No GIL (n_jobs=4) takes 0.141s

# _gil_test.pyx

import numpy as np
cimport numpy as np
from libc.math cimport sqrt, cos, sin, asin

cdef np.float64_t haversine_gil(np.float64_t[:] x1, np.float64_t[:] x2,
                                np.int64_t size) nogil except -1:
    if size != 2:
        with gil:
            raise ValueError()
    cdef np.float64_t sin_0 = sin(0.5 * (x1[0] - x2[0]))
    cdef np.float64_t sin_1 = sin(0.5 * (x1[1] - x2[1]))
    return (sin_0 * sin_0 + cos(x1[0]) * cos(x2[0]) * sin_1 * sin_1)


cdef np.float64_t haversine_nogil(np.float64_t[:] x1, np.float64_t[:] x2,
                                  np.int64_t size) nogil except -1:
    if size != 2:
        pass
    cdef np.float64_t sin_0 = sin(0.5 * (x1[0] - x2[0]))
    cdef np.float64_t sin_1 = sin(0.5 * (x1[1] - x2[1]))
    return (sin_0 * sin_0 + cos(x1[0]) * cos(x2[0]) * sin_1 * sin_1)


cpdef np.float64_t test_gil(np.float64_t[:,:] x1,
                              np.float64_t[:,:] x2):
    cdef np.float64_t out_sum = 0
    cdef int i = 0
    cdef int size = x1.shape[0]
    with nogil:
        while i < size:
            out_sum += haversine_gil(x1[i], x2[i], 2)
            i += 1
    return out_sum

cpdef np.float64_t test_nogil(np.float64_t[:,:] x1,
                              np.float64_t[:,:] x2):
    cdef np.float64_t out_sum = 0
    cdef int i = 0
    cdef int size = x1.shape[0]
    with nogil:
        while i < size:
            out_sum += haversine_nogil(x1[i], x2[i], 2)
            i += 1
    return out_sum

# setup.py

from setuptools import setup
from Cython.Build import cythonize
import numpy as np
setup(
    ext_modules = cythonize("_gil_test.pyx"),
    include_dirs=[np.get_include()]
)

# gil_test.py

import time
import numpy as np
from sklearn.utils import gen_even_slices
from joblib import Parallel, delayed
from _gil_test import test_gil, test_nogil

n_samples = 10000000
n_features = 2
np.random.seed(0)
x1 = np.random.randn(n_samples, n_features).astype(np.float64)
x2 = np.random.randn(n_samples, n_features).astype(np.float64)

n_iter = 5

for n_jobs in range(1, 5):
    start = time.time()
    for i in range(n_iter):
        Parallel(n_jobs=n_jobs, backend='threading')(
           delayed(test_gil)(x1[s], x2[s])
               for s in gen_even_slices(x1.shape[0], n_jobs))
    print("Conditionally acquire GIL (n_jobs={}) takes "
          "{:.3f}s".format(n_jobs, (time.time() - start) / n_iter))

for n_jobs in range(1, 5):
    start = time.time()
    for i in range(3):
        Parallel(n_jobs=n_jobs, backend='threading')(
            delayed(test_nogil)(x1[s], x2[s])
                for s in gen_even_slices(x1.shape[0], n_jobs))
    print("No GIL (n_jobs={}) takes "
          "{:.3f}s".format(n_jobs, (time.time() - start) / n_iter))

rth · 2020-04-26T15:13:06Z

Thanks you are right, it's pretty bad. Using your scripts with

n_samples = 2000000
n_features = 200

I get

Conditionally acquire GIL (n_jobs=1) takes 0.171s
Conditionally acquire GIL (n_jobs=2) takes 0.775s
Conditionally acquire GIL (n_jobs=3) takes 2.186s
Conditionally acquire GIL (n_jobs=4) takes 3.021s
No GIL (n_jobs=1) takes 0.063s
No GIL (n_jobs=2) takes 0.035s
No GIL (n_jobs=3) takes 0.026s
No GIL (n_jobs=4) takes 0.021s

Opened cython/cython#3554 . It would be nice to also fix other affected metrics: SEuclideanDistance, WMinkowskiDistance, MahalanobisDistance.

sklearn/neighbors/_dist_metrics.pyx

rth · 2020-04-26T15:22:13Z

For CI I think you need to also validate input in BinaryTree.query and query_radius, as well as DistanceMetric.pairwise. Maybe we could add a private method DistanceMetric._validate_data then specialize it in child classes as needed, and use it where input validation is necessary. This would be also more consistent with SLEP 010

webber26232 · 2020-04-26T16:10:25Z

For CI I think you need to also validate input in BinaryTree.query and query_radius, as well as DistanceMetric.pairwise. Maybe we could add a private method DistanceMetric._validate_data then specialize it in child classes as needed, and use it where input validation is necessary. This would be also more consistent with SLEP 010

Yes, I think that's a good idea.

rth · 2020-05-02T10:39:42Z

Given that this PR is unlikely to make it in v0.23 (Release Candidate released in the following days) and that this is fixed in cython 3.0a3, maybe we can just wait for the stable cython release and re-build scikit-learn wheels with it..

rth · 2020-05-02T11:03:14Z

Although doing input validation in a DistanceMetric._validate_data instead of the rdist or dist that are called in a loop over samples would be an improvement in any case. So this PR would still be welcome.

jnothman · 2020-05-02T13:38:44Z

Is n_features=200 realistic for haversine

rth · 2020-05-02T15:44:05Z

Is n_features=200 realistic for haversine

No, my mistake. But anyway the initial benchmarks in #17038 (comment) are still relevant (and correct). I was just double checking.

webber26232 · 2020-05-02T23:04:16Z

Although doing input validation in a DistanceMetric._validate_data instead of the rdist or dist that are called in a loop over samples would be an improvement in any case. So this PR would still be welcome.

Sounds good, I'll start to work on it.

sklearn/neighbors/_dist_metrics.pyx

rth

Thanks @webber26232 and @scoder! LGTM.

thomasjpfan

Thank you @webber26232

This is a really nice improvement!

thomasjpfan · 2020-05-09T17:29:06Z

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Efficiency|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

merge new master

Merge branch 'master' of github.com:webber26232/scikit-learn into haversineGIL

webber26232 · 2020-05-11T20:48:45Z

@rth @thomasjpfan Thanks for your reviewing. I updated the What's New. It should be good to merge.

rth · 2020-05-11T20:58:40Z

Thanks! So WMinkowskiDistance etc and _validate_data are actually private and we can't refer to them in the changelog. Could you please reformulate it rather in terms of effect this would have on public classes e.g. NearestNeigbours, KNeighborsRegressor, KNeighborsClassifier when used with metric=".." and n_jobs>1 (or about BallTree/KDTree releasing the GIL with those metrics).

webber26232 · 2020-05-11T23:00:52Z

Thanks! So WMinkowskiDistance etc and _validate_data are actually private and we can't refer to them in the changelog. Could you please reformulate it rather in terms of effect this would have on public classes e.g. NearestNeigbours, KNeighborsRegressor, KNeighborsClassifier when used with metric=".." and n_jobs>1 (or about BallTree/KDTree releasing the GIL with those metrics).

I also updated it to referring neighbors.DistanceMetric.

sklearn/neighbors/_dist_metrics.pyx

thomasjpfan · 2020-05-23T15:56:59Z

Thank you @webber26232!

…17038)

webber26232 and others added 4 commits July 5, 2019 21:28

Merge pull request #1 from scikit-learn/master

9ff7130

upstream

Merge pull request #2 from scikit-learn/master

1a57da0

Merge from upstream

Merge pull request #4 from scikit-learn/master

c5a420f

Merge master

release GIL

f883e25

github-actions bot added the module:neighbors label Apr 25, 2020

rth mentioned this pull request Apr 26, 2020

Conditional with gil and threading cython/cython#3554

Closed

rth reviewed Apr 26, 2020

View reviewed changes

sklearn/neighbors/_dist_metrics.pyx Show resolved Hide resolved

rth mentioned this pull request May 2, 2020

Evaluate Cython 3.0 #17102

Closed

validate data out of loop

d7395ef

webber26232 changed the title ~~Release GIL in HaversineDistance for BallTree parallalization~~ Release GIL in DistanceMetric for parallelization when validating data is required May 4, 2020

scoder reviewed May 9, 2020

View reviewed changes

sklearn/neighbors/_dist_metrics.pyx Show resolved Hide resolved

rth approved these changes May 9, 2020

View reviewed changes

thomasjpfan approved these changes May 9, 2020

View reviewed changes

webber26232 and others added 4 commits May 10, 2020 22:38

Merge pull request #5 from scikit-learn/master

8e2cf50

merge new master

merge 0.24 master

457d92e

Merge branch 'master' of github.com:webber26232/scikit-learn into haversineGIL

update Whta's new

0074755

fix typos

9c2014e

refactor what's new

f277d9f

webber26232 added 2 commits May 11, 2020 18:22

fixed reference

895b298

remove private func references

a1039c6

thomasjpfan reviewed May 16, 2020

View reviewed changes

sklearn/neighbors/_dist_metrics.pyx Show resolved Hide resolved

thomasjpfan changed the title ~~Release GIL in DistanceMetric for parallelization when validating data is required~~ ENH Release GIL in DistanceMetric when validating data May 23, 2020

thomasjpfan merged commit fc474d6 into scikit-learn:master May 23, 2020

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

ENH Release GIL in DistanceMetric when validating data (scikit-learn#…

751d5bd

…17038)

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

ENH Release GIL in DistanceMetric when validating data (scikit-learn#…

0df5700

…17038)

webber26232 deleted the haversineGIL branch October 22, 2020 21:15

webber26232 restored the haversineGIL branch October 22, 2020 21:19

pavlin-policar mentioned this pull request Mar 9, 2021

Why does openTSNE use method=ball_tree and not method=auto for exact kNN? pavlin-policar/openTSNE#162

Closed

rachitk mentioned this pull request Jul 1, 2021

Fix tests that fail due to changes in sklearn 0.24.0 EpistasisLab/tpot#1215

Closed

Uh oh!

Conversation

webber26232 commented Apr 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth commented Apr 25, 2020

Uh oh!

webber26232 commented Apr 25, 2020

Uh oh!

rth commented Apr 25, 2020

Uh oh!

webber26232 commented Apr 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Apr 26, 2020

Uh oh!

Uh oh!

rth commented Apr 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

webber26232 commented Apr 26, 2020

Uh oh!

rth commented May 2, 2020

Uh oh!

rth commented May 2, 2020

Uh oh!

jnothman commented May 2, 2020 via email

Uh oh!

rth commented May 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

webber26232 commented May 2, 2020

Uh oh!

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented May 9, 2020

Uh oh!

webber26232 commented May 11, 2020

Uh oh!

rth commented May 11, 2020

Uh oh!

webber26232 commented May 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented May 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

webber26232 commented Apr 25, 2020 •

edited

Loading

webber26232 commented Apr 26, 2020 •

edited

Loading

rth commented Apr 26, 2020 •

edited

Loading

rth commented May 2, 2020 •

edited

Loading

webber26232 commented May 11, 2020 •

edited

Loading