[MRG+2] ENH use pairwise_distances_chunked in silhouette_score#11135
[MRG+2] ENH use pairwise_distances_chunked in silhouette_score#11135jnothman merged 7 commits intoscikit-learn:masterfrom
Conversation
|
Note that I believe no tests are needed: the functionality of |
doc/whats_new/v0.20.rst
Outdated
|
|
||
| - :func:`metrics.cluster.silhouette_score` and | ||
| :func:`metrics.cluster.silhouette_samples` are more memory efficient, | ||
| causing them to run faster. :issue:`11135` by `Joel Nothman`_. |
There was a problem hiding this comment.
As previously discussed, it would run faster only assuming that swapping occurs, meaning that the full distance matrix takes more than RAM size but less than RAM + SWAP disk size (typical sizes here) which is IMO a too strong assumption.
There was a problem hiding this comment.
Given the reported freezes...?
There was a problem hiding this comment.
It's definitely significantly faster in light of benchmarks below - but I think because of a better implementation, not just due to chunking.
Maybe "are more memory efficient, and faster"? or "are significantly more ..."
| # accumulate distances from each sample to each cluster | ||
| clust_dists = np.zeros((len(D_chunk), len(label_freqs)), | ||
| dtype=D_chunk.dtype) | ||
| np.add.at(clust_dists.T, labels, D_chunk.T) |
There was a problem hiding this comment.
There are some reports about np.add.at being slow (numpy/numpy#5922, numpy/numpy#11156) and it's currently almost not used in the code base, do you have an opposite experience with it?
I guess there is not much alternatives since
clust_dists.T[labels] += D_chunk.Tand np.bincount (as suggested in the first issue) wouldn't work here as far as I understand?
There was a problem hiding this comment.
I'm unaware of those reports. I'll have a quick look.
There was a problem hiding this comment.
clust_dists.T[labels] += D_chunk.T will certainly not work if labels has duplicates, which it does. No, bincount won't work either, given that the output is 2d.
We could do this in a for loop. Is it worth trying that, or should we assume that it's not the bottleneck??
There was a problem hiding this comment.
Yes, I'm getting better mileage with a for loop (assuming I've correctly implemented it)
n_samples = 10000
n_samples_chunk = 100
n_clusters = 10
D_chunk = np.random.rand(n_samples_chunk, n_samples)
labels = np.random.randint(n_clusters, size=n_samples)
label_freqs = np.bincount(labels)
clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
dtype=D_chunk.dtype)
%timeit np.add.at(clust_dists.T, labels, D_chunk.T)
clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
dtype=D_chunk.dtype)
%timeit for i in range(len(D_chunk)): clust_dists[i] += np.bincount(labels, weights=D_chunk[i], minlength=len(label_freqs))|
I made sure that there is a test with data from the original publication replicated perfectly. Thanks for the review! |
|
Adapting earlier benchmarks from #7177 (comment), where Detailsimport numpy as np
from sklearn.metrics.cluster import silhouette_score
from neurtu import delayed, Benchmark
n_features = 100
def runs():
for n_samples in [100, 1000, 5000, 10000, 20000, 40000]:
for n_labels in [4, 10, 100]:
tags = {'n_samples': n_samples, 'n_labels': n_labels}
rng = np.random.RandomState(42)
X = rng.rand(n_samples, n_features)
y = rng.randint(n_labels, size=n_samples)
yield delayed(silhouette_score, tags=tags)(X, y)
df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['n_labels', 'n_samples'], inplace=True)
df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})
df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)I get, which corresponds to a 2.5x speedup for large (Ignore cells with zero peak_memory - it's a measurement artifact). |
rth
left a comment
There was a problem hiding this comment.
LGTM, really great work!
P.S: could you please partially revert what's new formulation changes from #11135 (comment) - sorry about that.
|
BTW, if I read #7177 (comment) right, looking at the benchmarks above this PR should result in faster calculations than #7177 too.. |
|
The speed improvement since #7177 might be due to larger working memory, or due to np.add.at being removed... |
|
Thanks for the reviews! |
Used to be part of #10280 and elsewhere
Closes #7175
Closes #10279
Closes #7177