-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Heuristics for algorithm='auto' in NearestNeighbors #8213
Description
Description
When running NearestNeighbor on a dense array with n_samples ~ 5000, n_features ~ 100 and n_neighbors=1, the algorithm='auto' option selects kd_tree algorithm, which is in this case the slowest algorithm of those available.
Steps/Code to Reproduce
The bechmark script nn_benchmark.py can be found below,
import os
from time import time
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize
np.random.seed(999999)
n_features = 150
algorithm = os.environ['NN_ALGORITHM']
def _create_X(n_samples):
X = np.random.randn(n_samples, n_features)
normalize(X, norm='l2', copy=False)
return X
for n_train, n_test in [(1000, 10000),
(4000, 4000),
(10000, 2000)]:
X_train = _create_X(n_train)
X_test = _create_X(n_test)
mod = NearestNeighbors(n_neighbors=1,
algorithm=algorithm,
n_jobs=1
)
t0 = time()
mod.fit(X_train)
t1 = time()
mod.kneighbors(X_test, return_distance=False)
t2 = time()
if algorithm == 'auto':
print("Actual method:", mod._fit_method)
print('Training set: n_train={}, n_features={}'.format(n_train, n_features))
print('Test set: n_test={}, t_train= {:.2f} s, t_test = {:.2f}s'.format(n_test, t1-t0, t2-t1))which can be run (on Linux) with,
NN_ALGORITHM=auto /usr/bin/time -v python nn_benchmark.py
where /usr/bin/time -v is used to track the peak RAM usage in the returned "Maximum resident set size (kbytes)" field.
Expected Results
One could expect that the nearest neighbor algorithm selection heuristics would pick a reasonable choice for this fairly standard dataset. In addition, one could also have expected for ball_tree to outperform brute force search.
Actual Results
The results of this benchmarks are given below,
| Time (s) | auto | kd_tree | ball_tree | brute |
|---|---|---|---|---|
n_train=1000, n_test=10000 |
3.9 | 3.9 | 2.69 | 0.25 |
n_train=4000, n_test=4000 |
7.5 | 7.5 | 5.55 | 0.34 |
n_train=10000, n_test=1000 |
9.6 | 9.5 | 7.48 | 0.41 |
the automatically chosen kd_tree algorithm is consistently 10-20 times slower than the brute force search.
The memory usage is another issues, where brute force search uses up to 5x more RAM ,
| auto | kd_tree | ball_tree | brute | |
|---|---|---|---|---|
| Peak RAM (MB) | 82 MB | 82 MB | 81 MB | 398 MB |
but this could be addressed by chunking the pairwise_distances calculations (issue #7287), implemented in PR #7979 .
Am I missing something ?
Versions
Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1