[MRG+1] Adding support for sample weights to K-Means#10933
[MRG+1] Adding support for sample weights to K-Means#10933TomDLT merged 12 commits intoscikit-learn:masterfrom
Conversation
|
This is exciting! Before a proper review, can you please benchmark the effect on runtime when passed no weights? |
|
Yes, will do. |
TomDLT
left a comment
There was a problem hiding this comment.
Just a minor comment.
I need to do a second pass, but this is already very nice.
sklearn/cluster/k_means_.py
Outdated
| sample_weights = _check_sample_weights(X, sample_weights) | ||
|
|
||
| # verify that the number of samples is equal to the number of weights | ||
| if _num_samples(X) != len(sample_weights): |
There was a problem hiding this comment.
You can also use sklearn.utils.validation.check_consistent_length.
Also, this check does not seem to be tested.
There was a problem hiding this comment.
I could use sklearn.utils.validation.check_consistent_length, but then the error message will be very generic. Would you still prefer this?
I'll add tests for the check method.
There was a problem hiding this comment.
Right, I have no strong feeling about it, you can keep it this way.
|
Here's a first idea of the change in performance. I have this in from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
km = KMeans(n_clusters=3, random_state=42)On branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.8 msec per loopOn branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 36.4 msec per loopAs expected, adding this feature will slow down the performance very slightly when not using |
| centers = np.zeros((n_clusters, n_features), dtype=dtype) | ||
| weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype) | ||
|
|
||
| for i in range(n_samples): |
There was a problem hiding this comment.
Don't know if this is slow since Cython is compiled, yet you may try if np.add.at is faster:
weights_sum_in_cluster = np.zeros((n_clusters, ), dtype=dtype)
np.add.at(weights_sum_in_cluster, labels, sample_weights)
empty_clusters = np.where(weights_sum_in_cluster == 0)[0]There was a problem hiding this comment.
Unfortunately, that is considerably slower:
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 57.5 msec per loop|
Given that the performance drop isn't altogether that noticeable, I'd probably leave it as is. Any additional improvement for the case where |
|
I did a bit of line profiling on checked_sample_weights = _check_sample_weights(X, sample_weights)
centers, labels, n_iter = k_means_elkan(X, checked_sample_weights,
n_clusters, centers, tol=tol,
max_iter=max_iter, verbose=verbose)
sq_distances = (X - centers[labels]) ** 2
if sample_weights is not None:
sq_distances *= np.expand_dims(checked_sample_weights, axis=-1)
inertia = np.sum(sq_distances, dtype=np.float64)I did not look at |
|
Good catch. After implementing your suggestion I cannot make out any statistically significant difference. For |
sklearn/cluster/k_means_.py
Outdated
| inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64) | ||
| sq_distances = (X - centers[labels]) ** 2 | ||
| if sample_weights is not None: | ||
| sq_distances *= checked_sample_weights[:, np.newaxis] |
There was a problem hiding this comment.
Actually, on second thought, the operation is expensive because of a useless broadcasting.
Instead of n_samples * n_features multiplications, we can do only n_samples multiplications with:
if sample_weights is not None:
sq_distances = np.sum(sq_distances, axis=1, dtype=np.float64)
sq_distances *= checked_sample_weightsThere was a problem hiding this comment.
You are right that this makes the computation faster if sample_weights is not None. It obviously doesn't make a difference when no sample weights are passed.
I should clarify that when I said above that "I cannot make out any statistically significant difference" I was referring to the difference between the weighted_k_means and master branches, not before and after the change in _kmeans_single_elkan.
There was a problem hiding this comment.
Sure, but removing the unnecessary broadcasting should speed up the sample_weights is not None case.
There was a problem hiding this comment.
Definitely, I'm testing it right now
There was a problem hiding this comment.
Okay, new benchmarks, adding:
w = np.ones(X.shape[0])On branch weighted_k_means:
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loopOn branch master:
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loopAll subject to ~ 1 msec variation.
…weights is not None
|
In case it gets lost in the collapsed comments above, here are again the current benchmarks. In import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
w = np.ones(X.shape[0])
km = KMeans(n_clusters=3, random_state=42)On branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loopOn branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loopAll subject to ~ 1 msec variation between benchmarks. |
jnothman
left a comment
There was a problem hiding this comment.
Nice work! I hope i'm able to review this soon, but the next couple of weeks are full up!
| return centers[sort_index, :], sorted_labels | ||
|
|
||
|
|
||
| def test_k_means_weighted_vs_repeated(): |
There was a problem hiding this comment.
can we please parametrize or loop these tests to avoid repeating code for KMeans and MBKMeans?
| est_weighted.cluster_centers_, np.repeat(est_weighted.labels_, | ||
| sample_weights)) | ||
| assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0) | ||
| if not isinstance(estimator, MiniBatchKMeans): |
There was a problem hiding this comment.
In case you are wondering why this test is not done for MiniBatchKMeans, that's because it fails. That is not due to the changes in this PR, however. I did a quick test on the master branch, comparing the cluster_centers_ of KMeans and MiniBatchKMeans, and they are not the same.
|
I have no idea why the latest commit fails the lgtm check. I only updated the tests and they all check out. |
|
looks like a temporary outage to me
|
|
Hi guys, sorry about this - it seems that this particular base commit couldn't be built due to an outdated pip cache. We're working on fixing the "retry analysis" link to provide a way to overcome this in the future. |
|
Hi guys, has anyone had a chance to review this yet? |
jnothman
left a comment
There was a problem hiding this comment.
Basically cosmetic. Good work!
sklearn/cluster/_k_means.pyx
Outdated
| @cython.wraparound(False) | ||
| @cython.cdivision(True) | ||
| cpdef DOUBLE _assign_labels_array(np.ndarray[floating, ndim=2] X, | ||
| np.ndarray[floating, ndim=1] sample_weights, |
There was a problem hiding this comment.
Please use singluar sample_weight for local and global consistency
There was a problem hiding this comment.
So just to confirm, sample_weight instead of sample_weights throughout? I had been consistently using the plural everywhere.
There was a problem hiding this comment.
Yes, I think so. That is consistent with the rest of the library
There was a problem hiding this comment.
Alright, will do. Although personally, the singular makes me think that a scalar is requested.
There was a problem hiding this comment.
It's quite common to use singular nomenclature for vectors... though we are very inconsistent!
sklearn/cluster/_k_means.pyx
Outdated
| @cython.wraparound(False) | ||
| @cython.cdivision(True) | ||
| def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms, | ||
| def _mini_batch_update_csr(X, np.ndarray[floating, ndim=1] sample_weights, |
There was a problem hiding this comment.
(argh! Plurals! Do what you like here, I suppose)
sklearn/cluster/_k_means.pyx
Outdated
|
|
||
| dtype = np.float32 if floating is float else np.float64 | ||
| centers = np.zeros((n_clusters, n_features), dtype=dtype) | ||
| weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype) |
There was a problem hiding this comment.
weight_in_cluster would be a sufficient name
sklearn/cluster/_k_means.pyx
Outdated
| @cython.wraparound(False) | ||
| @cython.cdivision(True) | ||
| def _centers_dense(np.ndarray[floating, ndim=2] X, | ||
| np.ndarray[floating, ndim=1] sample_weights, |
There was a problem hiding this comment.
In general, though, we should prefer singular
sklearn/cluster/_k_means_elkan.pyx
Outdated
|
|
||
| def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_, int n_clusters, | ||
| def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_, | ||
| np.ndarray[floating, ndim=1, mode='c'] sample_weights, |
sklearn/cluster/k_means_.py
Outdated
| return np.ones(n_samples, dtype=X.dtype) | ||
| else: | ||
| # verify that the number of samples is equal to the number of weights | ||
| if n_samples != len(sample_weights): |
There was a problem hiding this comment.
We have a helper called check_consistent_length
There was a problem hiding this comment.
Tom made the same remark earlier, and I replied that the error message will then be very generic. I'd prefer the more specific error message, but happy to change.
sklearn/cluster/k_means_.py
Outdated
| return self | ||
|
|
||
| def fit_predict(self, X, y=None): | ||
| def fit_predict(self, X, y=None, sample_weights=None): |
| est_1.cluster_centers_, est_1.labels_) | ||
| centers_2, labels_2 = _sort_cluster_centers_and_labels( | ||
| est_2.cluster_centers_, est_2.labels_) | ||
| assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0) |
There was a problem hiding this comment.
You should be able to get the perfect v_measure even without sorting. This makes for a red herring when reading the code
There was a problem hiding this comment.
You're right, the sorting of the labels was before I had discovered v_measure_score and is now redundant
| sample_weights = None | ||
| checked_sample_weights = _check_sample_weights(X, sample_weights) | ||
| assert_equal(_num_samples(X), _num_samples(checked_sample_weights)) | ||
| assert_equal(X.dtype, checked_sample_weights.dtype) |
There was a problem hiding this comment.
If you're doing these checks, why no check that the output sums to 1?
| # repetition of the sample | ||
| sample_weights = np.random.randint(1, 5, size=n_samples) | ||
| X_repeat = np.repeat(X, sample_weights, axis=0) | ||
| for estimator in [KMeans(n_clusters=n_clusters, random_state=42), |
There was a problem hiding this comment.
Need to test the different init approaches as well
…d clutter in tests.
| # verify that the number of samples is equal to the number of weights | ||
| if n_samples != len(sample_weight): | ||
| raise ValueError("n_samples=%d should be == len(sample_weight)=%d" | ||
| % (n_samples, len(sample_weight))) |
There was a problem hiding this comment.
I didn't change this to check_consistent_length, because that would result in a very generic error message. Let me know if you'd still like me to change it.
|
Nice work ! I am a bit busy this week, I'll take a closer look next week. |
|
Oops I forgot the whats_new. |
|
Sure I'll do that later today. |
Reference Issues/PRs
See also #3998.
What does this implement/fix? Explain your changes.
This branch adds support for sample weights to the K-Means algorithm (as well as Minibatch K-Means). This is done by adding the optional parameter
sample_weightstoKMeans.fit,KMeans.partial_fit,KMeans.predict,KMeans.fit_predict,KMeans.fit_transform, as well ask_means.Full backwards compatibility of the public methods of the class
KMeansandMiniBatchKMeansis maintained.Any other comments?