Skip to content

Heuristics for algorithm='auto' in NearestNeighbors #8213

@rth

Description

@rth

Description

When running NearestNeighbor on a dense array with n_samples ~ 5000, n_features ~ 100 and n_neighbors=1, the algorithm='auto' option selects kd_tree algorithm, which is in this case the slowest algorithm of those available.

Steps/Code to Reproduce

The bechmark script nn_benchmark.py can be found below,

import os
from time import time

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize

np.random.seed(999999)
n_features = 150
algorithm = os.environ['NN_ALGORITHM']

def _create_X(n_samples):
    X = np.random.randn(n_samples, n_features)
    normalize(X, norm='l2', copy=False)
    return X

for n_train, n_test in [(1000, 10000),
                        (4000, 4000),
                        (10000, 2000)]:
    X_train = _create_X(n_train)
    X_test = _create_X(n_test)
    mod = NearestNeighbors(n_neighbors=1, 
                          algorithm=algorithm,
                          n_jobs=1
                          )

    t0 = time()
    mod.fit(X_train)
    t1 = time()
    mod.kneighbors(X_test, return_distance=False)
    t2 = time()
    if algorithm == 'auto':
        print("Actual method:", mod._fit_method)
    print('Training set: n_train={}, n_features={}'.format(n_train, n_features))
    print('Test set: n_test={}, t_train= {:.2f} s, t_test = {:.2f}s'.format(n_test, t1-t0, t2-t1))

which can be run (on Linux) with,

NN_ALGORITHM=auto /usr/bin/time -v python nn_benchmark.py

where /usr/bin/time -v is used to track the peak RAM usage in the returned "Maximum resident set size (kbytes)" field.

Expected Results

One could expect that the nearest neighbor algorithm selection heuristics would pick a reasonable choice for this fairly standard dataset. In addition, one could also have expected for ball_tree to outperform brute force search.

Actual Results

The results of this benchmarks are given below,

Time (s) auto kd_tree ball_tree brute
n_train=1000, n_test=10000 3.9 3.9 2.69 0.25
n_train=4000, n_test=4000 7.5 7.5 5.55 0.34
n_train=10000, n_test=1000 9.6 9.5 7.48 0.41

the automatically chosen kd_tree algorithm is consistently 10-20 times slower than the brute force search.

The memory usage is another issues, where brute force search uses up to 5x more RAM ,

auto kd_tree ball_tree brute
Peak RAM (MB) 82 MB 82 MB 81 MB 398 MB

but this could be addressed by chunking the pairwise_distances calculations (issue #7287), implemented in PR #7979 .

Am I missing something ?

Versions

Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions