[MRG+1] Adding support for sample weights to K-Means by jnhansen · Pull Request #10933 · scikit-learn/scikit-learn

jnhansen · 2018-04-07T16:11:47Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This branch adds support for sample weights to the K-Means algorithm (as well as Minibatch K-Means). This is done by adding the optional parameter sample_weights to KMeans.fit, KMeans.partial_fit, KMeans.predict, KMeans.fit_predict, KMeans.fit_transform, as well as k_means.

Full backwards compatibility of the public methods of the class KMeans and MiniBatchKMeans is maintained.

Any other comments?

jnothman · 2018-04-09T08:37:38Z

This is exciting! Before a proper review, can you please benchmark the effect on runtime when passed no weights?

jnhansen · 2018-04-09T08:54:07Z

Yes, will do.

TomDLT

Just a minor comment.
I need to do a second pass, but this is already very nice.

TomDLT · 2018-04-09T10:27:57Z

sklearn/cluster/k_means_.py

+    sample_weights = _check_sample_weights(X, sample_weights)
+
+    # verify that the number of samples is equal to the number of weights
+    if _num_samples(X) != len(sample_weights):


You can also use sklearn.utils.validation.check_consistent_length.
Also, this check does not seem to be tested.

I could use sklearn.utils.validation.check_consistent_length, but then the error message will be very generic. Would you still prefer this?

I'll add tests for the check method.

Right, I have no strong feeling about it, you can keep it this way.

jnhansen · 2018-04-09T11:34:49Z

Here's a first idea of the change in performance.

I have this in benchmark_k_means.py:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
km = KMeans(n_clusters=3, random_state=42)

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.8 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 36.4 msec per loop

As expected, adding this feature will slow down the performance very slightly when not using sample_weights. I may have a look and see if I can still improve the performance in the case without sample weights.

TomDLT · 2018-04-09T11:56:18Z

sklearn/cluster/_k_means.pyx

+    centers = np.zeros((n_clusters, n_features), dtype=dtype)
+    weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype)
+
+    for i in range(n_samples):


Don't know if this is slow since Cython is compiled, yet you may try if np.add.at is faster:

weights_sum_in_cluster = np.zeros((n_clusters, ), dtype=dtype) np.add.at(weights_sum_in_cluster, labels, sample_weights) empty_clusters = np.where(weights_sum_in_cluster == 0)[0]

Unfortunately, that is considerably slower:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)' 100 loops, best of 3: 57.5 msec per loop

jnhansen · 2018-04-10T13:59:18Z

Given that the performance drop isn't altogether that noticeable, I'd probably leave it as is. Any additional improvement for the case where sample_weights is None would make the code rather ugly because of the need for quite a lot of code duplication. Any thoughts?

TomDLT · 2018-04-10T15:44:16Z

I did a bit of line profiling on _kmeans_single_elkan.
It seems that the extra multiplication in the inertia is the main cause of the 10% increase in computation of your benchmark.
To remove it, maybe you could do something like:

checked_sample_weights = _check_sample_weights(X, sample_weights)
centers, labels, n_iter = k_means_elkan(X, checked_sample_weights,
                                        n_clusters, centers, tol=tol,
                                        max_iter=max_iter, verbose=verbose)
sq_distances = (X - centers[labels]) ** 2
if sample_weights is not None:
    sq_distances *= np.expand_dims(checked_sample_weights, axis=-1)
inertia = np.sum(sq_distances, dtype=np.float64)

I did not look at _kmeans_single_lloyd.

jnhansen · 2018-04-10T16:40:19Z

Good catch. After implementing your suggestion I cannot make out any statistically significant difference.

For algorithm="full" a ~10% difference remains. I thought it was due to _labels_inertia, but the bulk of the computational effort (~95%) there traces back to sklearn.metrics.pairwise.pairwise_distances_argmin_min, which doesn't even involve the sample weights.

…weights is None

TomDLT · 2018-04-11T09:01:45Z

sklearn/cluster/k_means_.py

-    inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
+    sq_distances = (X - centers[labels]) ** 2
+    if sample_weights is not None:
+        sq_distances *= checked_sample_weights[:, np.newaxis]


Actually, on second thought, the operation is expensive because of a useless broadcasting.
Instead of n_samples * n_features multiplications, we can do only n_samples multiplications with:

if sample_weights is not None: sq_distances = np.sum(sq_distances, axis=1, dtype=np.float64) sq_distances *= checked_sample_weights

You are right that this makes the computation faster if sample_weights is not None. It obviously doesn't make a difference when no sample weights are passed.

I should clarify that when I said above that "I cannot make out any statistically significant difference" I was referring to the difference between the weighted_k_means and master branches, not before and after the change in _kmeans_single_elkan.

Sure, but removing the unnecessary broadcasting should speed up the sample_weights is not None case.

Definitely, I'm testing it right now

Okay, new benchmarks, adding:

w = np.ones(X.shape[0])

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)' 100 loops, best of 3: 40.1 msec per loop $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)' 100 loops, best of 3: 39.4 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)' 100 loops, best of 3: 38 msec per loop

All subject to ~ 1 msec variation.

…weights is not None

jnhansen · 2018-04-14T08:24:23Z

In case it gets lost in the collapsed comments above, here are again the current benchmarks.

In benchmark_k_means.py:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
w = np.ones(X.shape[0])
km = KMeans(n_clusters=3, random_state=42)

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loop

All subject to ~ 1 msec variation between benchmarks.

jnothman

Nice work! I hope i'm able to review this soon, but the next couple of weeks are full up!

jnothman · 2018-04-14T11:51:02Z

sklearn/cluster/tests/test_k_means.py

+    return centers[sort_index, :], sorted_labels
+
+
+def test_k_means_weighted_vs_repeated():


can we please parametrize or loop these tests to avoid repeating code for KMeans and MBKMeans?

jnhansen · 2018-04-15T17:26:28Z

sklearn/cluster/tests/test_k_means.py

+                est_weighted.cluster_centers_, np.repeat(est_weighted.labels_,
+                                                         sample_weights))
+        assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0)
+        if not isinstance(estimator, MiniBatchKMeans):


In case you are wondering why this test is not done for MiniBatchKMeans, that's because it fails. That is not due to the changes in this PR, however. I did a quick test on the master branch, comparing the cluster_centers_ of KMeans and MiniBatchKMeans, and they are not the same.

jnhansen · 2018-04-16T08:12:35Z

I have no idea why the latest commit fails the lgtm check. I only updated the tests and they all check out.

jnothman · 2018-04-16T08:36:00Z

looks like a temporary outage to me

jhelie · 2018-04-16T13:25:22Z

Hi guys, sorry about this - it seems that this particular base commit couldn't be built due to an outdated pip cache. We're working on fixing the "retry analysis" link to provide a way to overcome this in the future.

jnhansen · 2018-05-14T07:20:32Z

Hi guys, has anyone had a chance to review this yet?

jnothman

Basically cosmetic. Good work!

jnothman · 2018-05-14T07:54:03Z

sklearn/cluster/_k_means.pyx

 @cython.wraparound(False)
 @cython.cdivision(True)
 cpdef DOUBLE _assign_labels_array(np.ndarray[floating, ndim=2] X,
+                                  np.ndarray[floating, ndim=1] sample_weights,


Please use singluar sample_weight for local and global consistency

So just to confirm, sample_weight instead of sample_weights throughout? I had been consistently using the plural everywhere.

Yes, I think so. That is consistent with the rest of the library

Alright, will do. Although personally, the singular makes me think that a scalar is requested.

It's quite common to use singular nomenclature for vectors... though we are very inconsistent!

jnothman · 2018-05-14T07:55:25Z

sklearn/cluster/_k_means.pyx

 @cython.wraparound(False)
 @cython.cdivision(True)
-def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
+def _mini_batch_update_csr(X, np.ndarray[floating, ndim=1] sample_weights,


(argh! Plurals! Do what you like here, I suppose)

jnothman · 2018-05-14T07:58:50Z

sklearn/cluster/_k_means.pyx

+
+    dtype = np.float32 if floating is float else np.float64
+    centers = np.zeros((n_clusters, n_features), dtype=dtype)
+    weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype)


weight_in_cluster would be a sufficient name

jnothman · 2018-05-14T07:59:19Z

sklearn/cluster/_k_means.pyx

 @cython.wraparound(False)
 @cython.cdivision(True)
 def _centers_dense(np.ndarray[floating, ndim=2] X,
+        np.ndarray[floating, ndim=1] sample_weights,


In general, though, we should prefer singular

jnothman · 2018-05-14T07:59:47Z

sklearn/cluster/_k_means_elkan.pyx


-def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_, int n_clusters,
+def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_,
+                  np.ndarray[floating, ndim=1, mode='c'] sample_weights,


sample_weight

jnothman · 2018-05-14T08:01:12Z

sklearn/cluster/k_means_.py

+        return np.ones(n_samples, dtype=X.dtype)
+    else:
+        # verify that the number of samples is equal to the number of weights
+        if n_samples != len(sample_weights):


We have a helper called check_consistent_length

Tom made the same remark earlier, and I replied that the error message will then be very generic. I'd prefer the more specific error message, but happy to change.

jnothman · 2018-05-14T08:04:36Z

sklearn/cluster/k_means_.py

        return self

-    def fit_predict(self, X, y=None):
+    def fit_predict(self, X, y=None, sample_weights=None):


sample_weight

jnothman · 2018-05-14T08:08:42Z

sklearn/cluster/tests/test_k_means.py

+                est_1.cluster_centers_, est_1.labels_)
+        centers_2, labels_2 = _sort_cluster_centers_and_labels(
+                est_2.cluster_centers_, est_2.labels_)
+        assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0)


You should be able to get the perfect v_measure even without sorting. This makes for a red herring when reading the code

You're right, the sorting of the labels was before I had discovered v_measure_score and is now redundant

jnothman · 2018-05-14T08:09:45Z

sklearn/cluster/tests/test_k_means.py

+    sample_weights = None
+    checked_sample_weights = _check_sample_weights(X, sample_weights)
+    assert_equal(_num_samples(X), _num_samples(checked_sample_weights))
+    assert_equal(X.dtype, checked_sample_weights.dtype)


If you're doing these checks, why no check that the output sums to 1?

jnothman · 2018-05-14T08:10:52Z

sklearn/cluster/tests/test_k_means.py

+    # repetition of the sample
+    sample_weights = np.random.randint(1, 5, size=n_samples)
+    X_repeat = np.repeat(X, sample_weights, axis=0)
+    for estimator in [KMeans(n_clusters=n_clusters, random_state=42),


Need to test the different init approaches as well

…d clutter in tests.

jnhansen · 2018-05-15T17:42:00Z

sklearn/cluster/k_means_.py

+        # verify that the number of samples is equal to the number of weights
+        if n_samples != len(sample_weight):
+            raise ValueError("n_samples=%d should be == len(sample_weight)=%d"
+                             % (n_samples, len(sample_weight)))


I didn't change this to check_consistent_length, because that would result in a very generic error message. Let me know if you'd still like me to change it.

jnothman

LGTM!

jnhansen · 2018-05-16T10:56:58Z

@TomDLT?

TomDLT · 2018-05-16T11:19:43Z

Nice work ! I am a bit busy this week, I'll take a closer look next week.

TomDLT

LGTM

TomDLT · 2018-05-22T09:48:24Z

Oops I forgot the whats_new.
Could you add an entry in doc/whats_new/v0.20.rst in a new PR?

jnhansen · 2018-05-22T10:27:28Z

Sure I'll do that later today.

jnhansen added 5 commits April 7, 2018 15:56

weighted k means

0d1ec58

Added a paragraph for K-Means sample weights

2017864

use assert_almost_equal when checking v_measure_score

468ebf3

fix division in Python 2.7

27d3338

fix flake8 errors

25cb712

jnhansen changed the title ~~Adding support for sample weights to K-Means~~ [MRG] Adding support for sample weights to K-Means Apr 7, 2018

TomDLT reviewed Apr 9, 2018

View reviewed changes

jnhansen added 2 commits April 9, 2018 13:28

added tests for sample_weights checks

c0e3156

improvement in _centers_sparse

584ac16

more efficient inertia calculation in _kmeans_single_elkan if sample_…

06dd38d

…weights is None

TomDLT reviewed Apr 11, 2018

View reviewed changes

more efficient inertia calculation in _kmeans_single_elkan if sample_…

5386ba6

…weights is not None

jnothman reviewed Apr 14, 2018

View reviewed changes

combine sample weight tests for KMeans and MiniBatchKMeans

0f06cd8

jnhansen commented Apr 15, 2018

View reviewed changes

jnothman approved these changes May 14, 2018

View reviewed changes

Changed nomenclature from sample_weights to sample_weight. Remove…

81c8838

…d clutter in tests.

jnhansen commented May 15, 2018

View reviewed changes

fix failing estimator checks with sample weights

fa20b4b

jnothman approved these changes May 15, 2018

View reviewed changes

jnothman changed the title ~~[MRG] Adding support for sample weights to K-Means~~ [MRG+1] Adding support for sample weights to K-Means May 15, 2018

TomDLT approved these changes May 22, 2018

View reviewed changes

TomDLT merged commit 4b24fbe into scikit-learn:master May 22, 2018

jnhansen mentioned this pull request May 22, 2018

[MRG] add whats_new entry for KMeans sample weight support #11119

Merged

jnothman mentioned this pull request Jun 19, 2018

Refactor tests for sample weights #11316

Closed

czhao028 mentioned this pull request Jul 19, 2018

[Help wanted] KMeans error thrown on updating to most recent SKLearn package #11636

Closed

qinhanmin2014 mentioned this pull request Jul 28, 2018

added sample_weight support to K-Means and Minibatch K-means #4218

Closed

		return centers[sort_index, :], sorted_labels


		def test_k_means_weighted_vs_repeated():

Uh oh!

Conversation

jnhansen commented Apr 7, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Apr 9, 2018

Uh oh!

jnhansen commented Apr 9, 2018

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnhansen commented Apr 9, 2018

Uh oh!

TomDLT Apr 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnhansen commented Apr 10, 2018

Uh oh!

TomDLT commented Apr 10, 2018

Uh oh!

jnhansen commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnhansen commented Apr 14, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnhansen commented Apr 16, 2018

Uh oh!

jnothman commented Apr 16, 2018 via email

Uh oh!

jhelie commented Apr 16, 2018

Uh oh!

jnhansen commented May 14, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnhansen May 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

TomDLT Apr 9, 2018 •

edited

Loading

jnhansen commented Apr 10, 2018 •

edited

Loading

jnhansen May 14, 2018 •

edited

Loading