-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
BallTree haversine metric is not paralleled #16970
Copy link
Copy link
Closed
Labels
Description
When I use the algorithm BallTree in KNeighborsRegressor, I found most of the metric can benefit from parallel computing by setting the n_jos argument during the inference, except for metric haversine.
The table below shows the inference time for different n_jobs with different metric. haversine is the only one whose inference time increases when utilizing parallel computing.
| n_jobs | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| metric | ||||
| braycurtis | 24.9 | 14.7 | 10.9 | 9.5 |
| canberra | 46.9 | 27.6 | 20.8 | 18.4 |
| chebyshev | 11.1 | 6.9 | 5.7 | 4.9 |
| cityblock | 11.0 | 6.9 | 5.5 | 4.8 |
| dice | 537.9 | 341.9 | 277.2 | 248.5 |
| euclidean | 10.2 | 6.6 | 5.2 | 4.6 |
| hamming | 528.8 | 335.7 | 277.2 | 244.0 |
| haversine | 46.3 | 123.7 | 465.3 | 901.4 |
| infinity | 11.3 | 7.0 | 5.5 | 4.8 |
| jaccard | 542.1 | 345.5 | 275.7 | 243.5 |
| kulsinski | 546.0 | 343.5 | 278.4 | 253.5 |
| l1 | 11.0 | 7.0 | 5.4 | 4.8 |
| l2 | 10.2 | 6.3 | 5.1 | 4.5 |
| manhattan | 11.1 | 7.0 | 5.5 | 4.6 |
| matching | 524.8 | 324.3 | 263.3 | 238.5 |
| minkowski | 10.3 | 6.5 | 5.1 | 4.5 |
| p | 10.1 | 6.4 | 5.1 | 4.7 |
| rogerstanimoto | 515.9 | 323.1 | 269.9 | 242.2 |
| russellrao | 524.6 | 330.4 | 265.1 | 242.8 |
| sokalmichener | 515.3 | 322.5 | 266.6 | 247.8 |
| sokalsneath | 530.7 | 335.6 | 276.4 | 241.4 |
I did the test as below:
import time
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
n_samples = 1000000
n_test_samples = 10000
np.random.seed(0)
lat = np.deg2rad(np.random.uniform(-90, 90, size=n_samples))[:, np.newaxis]
lon = np.deg2rad(np.random.uniform(-180, 180, size=n_samples))[:, np.newaxis]
X = np.concatenate([lat, lon], axis=1)
y = np.random.randn(n_samples)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=n_test_samples)
ball_tree_metirc = ['euclidean',
'l2',
'minkowski',
'p',
'manhattan',
'cityblock',
'l1',
'chebyshev',
'infinity',
'hamming',
'canberra',
'braycurtis',
'matching',
'jaccard',
'dice',
'kulsinski',
'rogerstanimoto',
'russellrao',
'sokalmichener',
'sokalsneath',
'haversine']
plt.figure(figsize=(8, 8))
for metric in ball_tree_metirc:
plot_x = []
plot_y = []
for n_jobs in [1, 2, 3, 4]:
try:
params = dict(metric=metric, n_jobs=n_jobs)
knr = KNeighborsRegressor(n_neighbors=2000,
weights='uniform',
algorithm='ball_tree',
**params)
knr.fit(X_tr, y_tr)
te_start = time.time()
pred = knr.predict(X_te)
plot_x.append(n_jobs)
plot_y.append(time.time() - te_start)
except:
pass
plt.plot(plot_x, plot_y, label=metric)
plt.title('testing time')
plt.xlabel('n_jobs')
plt.xticks([1, 2, 3, 4])
plt.ylabel('s')
plt.legend()
My Environment:
OS: Windows Server 2016 Standard
CPU: Intel Xeon Gold 6140 (8 processors)
RAM: 32 GB
python: 3.8.2
numpy: 1.18.1
mkl: 2020.0
scipy: 1.3.2
scikit-learn: 0.22.2.post
Any thoughts? Is it possible that the GIL is not appropriately released in sklearn.neighbors._dist_metrics.HaversineDistance
Reactions are currently unavailable