Skip to content

Batch pairwise_distances in neighbors to reduce memory consumption #7287

@jnothman

Description

@jnothman

The neighbors subpackage uses sklearn.metrics.pairwise.pairwise_distances in a number of cases where KDTree and BallTree cannot be applied. it calculates distances between all pairs but then retains/outputs only information about the nearest neighbours. Rather than requiring intermediate memory for storing the complete O(n^2) pairs' distances, we can limit it to a fixed maximum intermediate storage requirement. 64MiB works well in the similar case in silhouette_score (proposed in #7177), and we use a similar approach in pairwise_distances_argmin_min.

The general solution would add a block_size parameter or similar to pairwise_distances which turns it into a generator of blocks; or a pairwise_distances_blockwise if we'd rather a separate function. This would then be used by neighbors, silhouette_score, pairwise_distances_argmin_min, collecting sufficient statistics from each block and releasing other intermediate scores.

I think block_size specified in MiB or MB is more useful than specifying slice size directly.

pairwise_distances_argmin_min iterates over square blocks of the pairwise distance matrix. I think it may be more practical to iterate over row slices of the output, although this means there may be a minimum valid block size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementLarge ScaleModerateAnything that requires some knowledge of conventions and best practices

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions