-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Batch pairwise_distances in neighbors to reduce memory consumption #7287
Description
The neighbors subpackage uses sklearn.metrics.pairwise.pairwise_distances in a number of cases where KDTree and BallTree cannot be applied. it calculates distances between all pairs but then retains/outputs only information about the nearest neighbours. Rather than requiring intermediate memory for storing the complete O(n^2) pairs' distances, we can limit it to a fixed maximum intermediate storage requirement. 64MiB works well in the similar case in silhouette_score (proposed in #7177), and we use a similar approach in pairwise_distances_argmin_min.
The general solution would add a block_size parameter or similar to pairwise_distances which turns it into a generator of blocks; or a pairwise_distances_blockwise if we'd rather a separate function. This would then be used by neighbors, silhouette_score, pairwise_distances_argmin_min, collecting sufficient statistics from each block and releasing other intermediate scores.
I think block_size specified in MiB or MB is more useful than specifying slice size directly.
pairwise_distances_argmin_min iterates over square blocks of the pairwise distance matrix. I think it may be more practical to iterate over row slices of the output, although this means there may be a minimum valid block size.