[MRG+1] Add sample weights support to kernel density estimation (fix #4394) by samronsin · Pull Request #10803 · scikit-learn/scikit-learn

samronsin · 2018-03-13T08:01:30Z

This PR adds sample weights support to kernel density estimation by adding a sample_weight array in the BinaryTree object and updating the traversal of the tree whenever sample weights are detected, all in the binary_tree.pyi Cython code.

This PR also updates the centroids of the BallTree to take the sample weights into account in ball_tree.pyx code.

…earn#4394)

jnothman · 2018-03-13T09:52:01Z

Thanks! please add tests, for example to show the equivalence of weighting and repetition

…teger weights

samronsin · 2018-03-14T00:20:31Z

Thanks, I just added such tests.
I'm wondering what the result should be for zero or negative weights.
Leaves with negative or zero total weight should probably be ignored.

samronsin · 2018-03-19T10:45:22Z

I ended up raising a ValueError for non-positive weights, since such weights would create nans and infs in the low-level code.

…tion of the index in the non-weighted case

samronsin · 2018-03-31T16:06:15Z

@jnothman Is there something I can do to help this PR move forward ?
@jakevdp It looks like you are the author of the modified files, so your advice would be very valuable and appreciated !

jnothman · 2018-04-02T12:26:43Z

I don't have enough time available to review atm, sorry.

I do feel like tests along the lines of:

ensure weighting has some effect
ensure weighting is equivalent to repetition
insured weight effect is scale invariant for a positive scaling factor

are sufficient from a correctness perspective. I think you've got the middle point covered.

samronsin · 2018-04-03T14:55:53Z

@jnothman no rush, I'll add more tests along the lines you suggested, thanks !

TomDLT

Thanks for this work.
I have some small comments, and I still need to go though the Cython code.

Also don't forget to add the suggested tests.

TomDLT · 2018-04-13T15:41:50Z

sklearn/neighbors/tests/test_kde.py

+
+
+def test_kde_sample_weights():
+    N = 10000


Please make sure the test is not too long, reducing N and T if necessary.
You can do it with pytest --duration 10 sklearn/neighbors/tests/test_kde.py::test_kde_sample_weights.

TomDLT · 2018-04-13T15:43:52Z

sklearn/neighbors/tests/test_kde.py

+
+def test_kde_sample_weights():
+    N = 10000
+    T = 100


Please avoid using single-letter variables (except X).
example: change N to n_samples.

TomDLT · 2018-04-13T15:46:46Z

sklearn/neighbors/tests/test_kde.py

+    for d in [1, 2, 10]:
+        rng = np.random.RandomState(0)
+        X = rng.rand(N, d)
+        W = 1 + (10 * X.sum(axis=1)).astype(np.int8)


W = rng.randint(10, size=N) is probably clearer.

Indeed, however the goal is not to have weights uniformly distributed, since this would be (asymptotically) equivalent to having uniform weights.
So I picked weights to be entirely determined the (L1) norm of the vector to have a simple pattern where weights are positive integers.

TomDLT · 2018-04-13T15:49:15Z

sklearn/neighbors/tests/test_kde.py

+            for _ in range(w):
+                repetitions.append(x.tolist())
+        X_repetitions = np.array(repetitions)
+        Y = rng.rand(T // d, d)


Why T // d ?

We want the first argument (the number of test points) to be an integer, both in python2 and python3.

Yes but why would you divide T by d? It is just another integer that you could name n_samples_2, isn't it?

Yes you're right, I'll rename this to make it more expressive.
I do this mainly because I want to keep the number of operations under control while increasing the dimension of the space, so I decrease the cardinality of the test sample.

TomDLT · 2018-04-13T15:53:20Z

sklearn/neighbors/kde.py

+        if sample_weight is not None:
+            if not hasattr(sample_weight, 'shape'):
+                sample_weight = np.array(sample_weight)
+            if len(sample_weight.shape) != 1:


sample_weight.ndim is more appropriate.

samronsin · 2018-04-14T19:45:51Z

@TomDLT thanks for your comments ! I just updated the branch accordingly.

samronsin · 2018-04-16T10:46:21Z

@jnothman I added tests for "weighting has some effect" and "weight effect is scale invariant".

samronsin · 2018-05-25T13:10:50Z

@TomDLT did you get any chance to look at the Cython ?
I was wondering if I should add a few comments to it in order to explain the general idea, namely that I simply replace all counts by sums of weights.

jnothman · 2018-05-26T13:08:49Z

Sorry for slow review. I don't think the cython needs comments.

jnothman · 2018-05-26T13:11:23Z

sklearn/neighbors/binary_tree.pxi

        cdef ITYPE_t n_samples = self.data.shape[0]
+        cdef DTYPE_t Z
+        if self.sample_weight is not None:
+            Z = self.sum_weight


Why do you need both Z and sum_weight?

Z is either sum_weight or n_samples, so in this context sum_weight is just as useful as n_samples although much costlier to produce (hence better to compute it just once at __init__ time).

Yes, but why not store sum_weight = n_samples in init and use it in all cases here rather than introduce a new variable

That would make sense indeed, thanks for the suggestion !

jnothman · 2018-05-26T13:11:43Z

sklearn/neighbors/binary_tree.pxi

-                    log_density = compute_log_kernel(dist_pt, h, kernel)
-                    global_log_min_bound = logaddexp(global_log_min_bound,
-                                                     log_density)
+                if with_sample_weight:


Does all this branching provide much benefit?

I did not compare it to

for i in range(node_info.idx_start, node_info.idx_end): dist_pt = self.dist(pt, data + n_features * idx_array[i], n_features) log_density = compute_log_kernel(dist_pt, h, kernel) if with_sample_weight: log_weight = np.log(sample_weight[idx_array[i]]) else: log_weight = 0 global_log_min_bound = logaddexp(global_log_min_bound, log_density + log_weight)

but that did not look much simpler to me. Do you have another suggestion that I am missing ?

It looks more maintainable to me, which is my concern. Near-duplicate code is hard to see differences in

I see your point, I'll do some benchmarking to see if this branching is a actually improving anything.

Removing the branching slows down the fit method (both with and without sample weights unfortunately) by about 10-15%, for 10^5 sample with 20 columns on my local machine.

I committed the simplified code anyway. Happy to revert the commit if we want to avoid the consequent slowdown.

jnothman · 2018-05-26T13:12:50Z

sklearn/neighbors/kde.py

+                                 " but was {1}".format(X.shape[0],
+                                                       sample_weight.shape))
+            if sample_weight.shape[0] != X.shape[0]:
+                raise ValueError("X and sample_weight have incompatible "


Usually we use check_consistent_length

…performance in the fit method (might want to revert)

jnothman · 2018-06-04T08:27:35Z

I suppose 15% is quite substantial...

jnothman

Apart from nitpicks, the tests look good, so I trust the code (which also looks good but I've only looked at it briefly so far)

jnothman · 2018-06-04T08:30:30Z

sklearn/neighbors/tests/test_kde.py

+
+
+def test_kde_sample_weights():
+    n_samples = 2500


How long does this test take to run? This seems larger than necessary to prove the point

5 seconds on my laptop. Indeed, I can decrease this to a few 100s and this would take ~1 sec.

jnothman · 2018-06-04T08:31:25Z

sklearn/neighbors/tests/test_kde.py

+        X = rng.rand(n_samples, d)
+        weights = 1 + (10 * X.sum(axis=1)).astype(np.int8)
+        repetitions = []
+        for x, w in zip(X, weights):


np.repeat should work here

jnothman · 2018-06-04T08:34:23Z

sklearn/neighbors/tests/test_kde.py

+        n_samples_test = size_test // d
+        test_points = rng.rand(n_samples_test, d)
+        for algorithm in ['auto', 'ball_tree', 'kd_tree']:
+            for metric in ['euclidean', 'minkowski', 'manhattan',


Any reason to believe minkowski without parameter would work differently?

jnothman · 2018-06-04T08:35:59Z

sklearn/neighbors/kde.py

        X = check_array(X, order='C', dtype=DTYPE)

+        if sample_weight is not None:
+            if not hasattr(sample_weight, 'shape'):


Why not use check_array?

jnothman · 2018-06-04T08:45:45Z

sklearn/neighbors/kde.py

+            if not hasattr(sample_weight, 'shape'):
+                sample_weight = np.array(sample_weight)
+            if sample_weight.ndim != 1:
+                raise ValueError("the shape of sample_weight must be ({0},),"


Are these covered by tests?

jnothman · 2018-06-04T08:46:39Z

sklearn/neighbors/binary_tree.pxi

                global_log_min_bound[0] = logaddexp(global_log_min_bound[0],
-                                                    log_dens_contribution)
+                                                    log_dens_contribution +
+                                                        log_weight)


I think we'd usually not want this extra indentation. You could do:

(log_dens_contribution + log_weight)

jnothman · 2018-06-04T08:49:54Z

sklearn/neighbors/binary_tree.pxi

-                N2 = node_data[i2].idx_end - node_data[i2].idx_start
+                if with_sample_weight:
+                    N1 = 0.0
+                    for i in range(node_data[i1].idx_start, node_data[i1].idx_end):


Please create a cdef inline to avoid repetition of this idiom.

samronsin · 2018-06-04T09:16:32Z

I suppose 15% is quite substantial...

So would it be better to revert commit 8ff81e5 ?

jnothman · 2018-06-04T09:24:43Z

If the 15% was also more than a few seconds, then yes, I think so.

…

On 4 June 2018 at 19:16, Samuel O. Ronsin ***@***.***> wrote: I suppose 15% is quite substantial... So would it be better to revert commit 8ff81e5 <8ff81e5> ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10803 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wlqzfehX7FOcEdpX48E5uAnWb7Xks5t5PrzgaJpZM4SoOrd> .

samronsin · 2018-06-04T18:06:22Z

On a 10^5 x 20 sample, the difference is ~100ms.
On a 10^6 x 50 sample, the difference is less consistent and about a few seconds at most.

jnothman · 2018-06-05T07:53:50Z

so a few seconds on 1e6 might be no big deal. But others' opinions may differ

samronsin · 2018-06-25T10:51:47Z

Is there something else I can do ? Should the title be [MRG+1] now ?

jnothman · 2018-06-25T11:06:49Z

it can be MRG+1 if you'd like... but we're sort of phasing that out in preference for GitHub's Approved thing.

samronsin · 2018-06-25T11:10:43Z

Ok, I didn't know that, thanks !

TomDLT

LGTM
Please also add an entry in doc/whats_new/v0.20.rst

TomDLT · 2018-06-26T08:56:55Z

sklearn/neighbors/kde.py

+                                                       sample_weight.shape))
+            check_consistent_length(X, sample_weight)
+            if sample_weight.min() <= 0:
+                raise ValueError("sample_weight must have positive values")


This is not covered by the tests

Bien vu ! Added a test for that as well.

samronsin · 2018-06-26T13:06:40Z

Thank you guys @jnothman and @TomDLT for reviewing this PR !

jnothman · 2018-06-26T13:12:00Z

Thank you for a very nice contribution!

raimon-fa · 2018-07-13T16:19:29Z

Hi! Is this feature already available? I don't see it in the documentation of sklearn.
Thanks a lot!

TomDLT · 2018-07-13T16:32:47Z

It is not in the stable doc, but it is in the dev doc.
It means that it is not included in latest version (0.19), but it will be included in next version (0.20).
Version 0.20 should be released in a few weeks.
If you can't wait to use it, you can install the dev version.

Add sample weights support to kernel density estimation (fix scikit-l…

a78cfa6

…earn#4394)

Add tests for the equivalence of sample weights and repetition for in…

5615e51

…teger weights

samronsin added 4 commits March 14, 2018 08:27

Fix integer division in new tests (Python 3)

1b6dd8e

Add tests for neutrality of constant sample weights

92bd0f7

Add checks for sample_weight parameter (shape, positive values)

276ff69

Add automatic conversion to ndarray for sample_weight

7a6113d

Add tests for sampling along with a minor change in the random genera…

c2011a6

…tion of the index in the non-weighted case

samronsin changed the title ~~Add sample weights support to kernel density estimation (fix #4394)~~ [MRG] Add sample weights support to kernel density estimation (fix #4394) Mar 26, 2018

TomDLT reviewed Apr 13, 2018

View reviewed changes

Address comments from review

ea60225

samronsin added 2 commits April 15, 2018 11:15

Make Travis happy

38e4b0e

Introduce n_samples_test

49f5a95

jnothman reviewed May 26, 2018

View reviewed changes

samronsin added 3 commits May 28, 2018 15:47

Use check_consistent_length to check length consistency

2a26843

Define sum_weight in all cases and get rid of redundant variable Z

ab6d4ad

Improve maintability by simplifying branching at the cost of ~10-15% …

8ff81e5

…performance in the fit method (might want to revert)

jnothman approved these changes Jun 4, 2018

View reviewed changes

samronsin added 2 commits June 4, 2018 16:01

Add inline function to compute the total weight of a node

452333b

Shrink size of tests (5x speedup)

45c5289

Address misc suggestions

0ac4bf3

jnothman mentioned this pull request Jun 19, 2018

Refactor tests for sample weights #11316

Closed

jnothman changed the title ~~[MRG] Add sample weights support to kernel density estimation (fix #4394)~~ [MRG+1] Add sample weights support to kernel density estimation (fix #4394) Jun 25, 2018

TomDLT approved these changes Jun 26, 2018

View reviewed changes

samronsin and others added 3 commits June 26, 2018 14:30

Add test for negative values of sample_weights

3014ecd

Update whats_new.rst

2b9c007

Merge branch 'master' into sample-weights-in-KDE

3aa1dee

jnothman merged commit f89131b into scikit-learn:master Jun 26, 2018

TomDLT mentioned this pull request Jul 16, 2018

[MRG+2] Add a test for sample weights for estimators #11558

Merged

qinhanmin2014 mentioned this pull request Aug 8, 2018

[MRG+1] FIX Pickled sample_weights in BinaryTree #11774

Merged

Charlie-XIAO mentioned this pull request Dec 18, 2023

FIX KernelDensity incorrectly handling bandwidth #27971

Open

Uh oh!

Conversation

samronsin commented Mar 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 13, 2018 via email

Uh oh!

samronsin commented Mar 14, 2018

Uh oh!

samronsin commented Mar 19, 2018

Uh oh!

samronsin commented Mar 31, 2018

Uh oh!

jnothman commented Apr 2, 2018

Uh oh!

samronsin commented Apr 3, 2018

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samronsin commented Apr 14, 2018

Uh oh!

samronsin commented Apr 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samronsin commented May 25, 2018

Uh oh!

jnothman commented May 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 4, 2018 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samronsin commented Mar 13, 2018 •

edited

Loading

samronsin commented Apr 16, 2018 •

edited

Loading

samronsin commented Jun 4, 2018 •

edited

Loading