Skip to content

BallTree haversine metric is not paralleled #16970

@webber26232

Description

@webber26232

When I use the algorithm BallTree in KNeighborsRegressor, I found most of the metric can benefit from parallel computing by setting the n_jos argument during the inference, except for metric haversine.

The table below shows the inference time for different n_jobs with different metric. haversine is the only one whose inference time increases when utilizing parallel computing.

n_jobs 1 2 3 4
metric
braycurtis 24.9 14.7 10.9 9.5
canberra 46.9 27.6 20.8 18.4
chebyshev 11.1 6.9 5.7 4.9
cityblock 11.0 6.9 5.5 4.8
dice 537.9 341.9 277.2 248.5
euclidean 10.2 6.6 5.2 4.6
hamming 528.8 335.7 277.2 244.0
haversine 46.3 123.7 465.3 901.4
infinity 11.3 7.0 5.5 4.8
jaccard 542.1 345.5 275.7 243.5
kulsinski 546.0 343.5 278.4 253.5
l1 11.0 7.0 5.4 4.8
l2 10.2 6.3 5.1 4.5
manhattan 11.1 7.0 5.5 4.6
matching 524.8 324.3 263.3 238.5
minkowski 10.3 6.5 5.1 4.5
p 10.1 6.4 5.1 4.7
rogerstanimoto 515.9 323.1 269.9 242.2
russellrao 524.6 330.4 265.1 242.8
sokalmichener 515.3 322.5 266.6 247.8
sokalsneath 530.7 335.6 276.4 241.4

I did the test as below:

import time
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

n_samples = 1000000
n_test_samples = 10000

np.random.seed(0)
lat = np.deg2rad(np.random.uniform(-90, 90, size=n_samples))[:, np.newaxis]
lon = np.deg2rad(np.random.uniform(-180, 180, size=n_samples))[:, np.newaxis]

X = np.concatenate([lat, lon], axis=1)
y = np.random.randn(n_samples)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=n_test_samples)

ball_tree_metirc = ['euclidean',
 'l2',
 'minkowski',
 'p',
 'manhattan',
 'cityblock',
 'l1',
 'chebyshev',
 'infinity',
 'hamming',
 'canberra',
 'braycurtis',
 'matching',
 'jaccard',
 'dice',
 'kulsinski',
 'rogerstanimoto',
 'russellrao',
 'sokalmichener',
 'sokalsneath',
 'haversine']

plt.figure(figsize=(8, 8))
for metric in ball_tree_metirc:
    plot_x = []
    plot_y = []
    for n_jobs in [1, 2, 3, 4]:
        try:
            params = dict(metric=metric, n_jobs=n_jobs)
            knr = KNeighborsRegressor(n_neighbors=2000,
                                      weights='uniform',
                                      algorithm='ball_tree',
                                      **params)
            knr.fit(X_tr, y_tr)
            te_start = time.time()
            pred = knr.predict(X_te)
            plot_x.append(n_jobs)
            plot_y.append(time.time() - te_start)
        except:
            pass
    plt.plot(plot_x, plot_y, label=metric)
plt.title('testing time')
plt.xlabel('n_jobs')
plt.xticks([1, 2, 3, 4])
plt.ylabel('s')
plt.legend()

My Environment:

OS: Windows Server 2016 Standard
CPU: Intel Xeon Gold 6140 (8 processors)
RAM: 32 GB
python: 3.8.2
numpy: 1.18.1
mkl: 2020.0
scipy: 1.3.2
scikit-learn: 0.22.2.post

Any thoughts? Is it possible that the GIL is not appropriately released in sklearn.neighbors._dist_metrics.HaversineDistance

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions