[MRG+1] Add sample weights support to kernel density estimation (fix #4394)#10803
[MRG+1] Add sample weights support to kernel density estimation (fix #4394)#10803jnothman merged 19 commits intoscikit-learn:masterfrom
Conversation
|
Thanks! please add tests, for example to show the equivalence of weighting
and repetition
|
|
Thanks, I just added such tests. |
|
I ended up raising a |
…tion of the index in the non-weighted case
|
I don't have enough time available to review atm, sorry. I do feel like tests along the lines of:
are sufficient from a correctness perspective. I think you've got the middle point covered. |
|
@jnothman no rush, I'll add more tests along the lines you suggested, thanks ! |
TomDLT
left a comment
There was a problem hiding this comment.
Thanks for this work.
I have some small comments, and I still need to go though the Cython code.
Also don't forget to add the suggested tests.
sklearn/neighbors/tests/test_kde.py
Outdated
|
|
||
|
|
||
| def test_kde_sample_weights(): | ||
| N = 10000 |
There was a problem hiding this comment.
Please make sure the test is not too long, reducing N and T if necessary.
You can do it with pytest --duration 10 sklearn/neighbors/tests/test_kde.py::test_kde_sample_weights.
sklearn/neighbors/tests/test_kde.py
Outdated
|
|
||
| def test_kde_sample_weights(): | ||
| N = 10000 | ||
| T = 100 |
There was a problem hiding this comment.
Please avoid using single-letter variables (except X).
example: change N to n_samples.
sklearn/neighbors/tests/test_kde.py
Outdated
| for d in [1, 2, 10]: | ||
| rng = np.random.RandomState(0) | ||
| X = rng.rand(N, d) | ||
| W = 1 + (10 * X.sum(axis=1)).astype(np.int8) |
There was a problem hiding this comment.
W = rng.randint(10, size=N) is probably clearer.
There was a problem hiding this comment.
Indeed, however the goal is not to have weights uniformly distributed, since this would be (asymptotically) equivalent to having uniform weights.
So I picked weights to be entirely determined the (L1) norm of the vector to have a simple pattern where weights are positive integers.
sklearn/neighbors/tests/test_kde.py
Outdated
| for _ in range(w): | ||
| repetitions.append(x.tolist()) | ||
| X_repetitions = np.array(repetitions) | ||
| Y = rng.rand(T // d, d) |
There was a problem hiding this comment.
We want the first argument (the number of test points) to be an integer, both in python2 and python3.
There was a problem hiding this comment.
Yes but why would you divide T by d? It is just another integer that you could name n_samples_2, isn't it?
There was a problem hiding this comment.
Yes you're right, I'll rename this to make it more expressive.
I do this mainly because I want to keep the number of operations under control while increasing the dimension of the space, so I decrease the cardinality of the test sample.
sklearn/neighbors/kde.py
Outdated
| if sample_weight is not None: | ||
| if not hasattr(sample_weight, 'shape'): | ||
| sample_weight = np.array(sample_weight) | ||
| if len(sample_weight.shape) != 1: |
There was a problem hiding this comment.
sample_weight.ndim is more appropriate.
|
@TomDLT thanks for your comments ! I just updated the branch accordingly. |
|
@jnothman I added tests for "weighting has some effect" and "weight effect is scale invariant". |
|
@TomDLT did you get any chance to look at the Cython ? |
|
Sorry for slow review. I don't think the cython needs comments. |
sklearn/neighbors/binary_tree.pxi
Outdated
| cdef ITYPE_t n_samples = self.data.shape[0] | ||
| cdef DTYPE_t Z | ||
| if self.sample_weight is not None: | ||
| Z = self.sum_weight |
There was a problem hiding this comment.
Why do you need both Z and sum_weight?
There was a problem hiding this comment.
Z is either sum_weight or n_samples, so in this context sum_weight is just as useful as n_samples although much costlier to produce (hence better to compute it just once at __init__ time).
There was a problem hiding this comment.
Yes, but why not store sum_weight = n_samples in init and use it in all cases here rather than introduce a new variable
There was a problem hiding this comment.
That would make sense indeed, thanks for the suggestion !
sklearn/neighbors/binary_tree.pxi
Outdated
| log_density = compute_log_kernel(dist_pt, h, kernel) | ||
| global_log_min_bound = logaddexp(global_log_min_bound, | ||
| log_density) | ||
| if with_sample_weight: |
There was a problem hiding this comment.
Does all this branching provide much benefit?
There was a problem hiding this comment.
I did not compare it to
for i in range(node_info.idx_start, node_info.idx_end):
dist_pt = self.dist(pt, data + n_features * idx_array[i],
n_features)
log_density = compute_log_kernel(dist_pt, h, kernel)
if with_sample_weight:
log_weight = np.log(sample_weight[idx_array[i]])
else:
log_weight = 0
global_log_min_bound = logaddexp(global_log_min_bound,
log_density + log_weight)but that did not look much simpler to me. Do you have another suggestion that I am missing ?
There was a problem hiding this comment.
It looks more maintainable to me, which is my concern. Near-duplicate code is hard to see differences in
There was a problem hiding this comment.
I see your point, I'll do some benchmarking to see if this branching is a actually improving anything.
There was a problem hiding this comment.
Removing the branching slows down the fit method (both with and without sample weights unfortunately) by about 10-15%, for 10^5 sample with 20 columns on my local machine.
There was a problem hiding this comment.
I committed the simplified code anyway. Happy to revert the commit if we want to avoid the consequent slowdown.
sklearn/neighbors/kde.py
Outdated
| " but was {1}".format(X.shape[0], | ||
| sample_weight.shape)) | ||
| if sample_weight.shape[0] != X.shape[0]: | ||
| raise ValueError("X and sample_weight have incompatible " |
There was a problem hiding this comment.
Usually we use check_consistent_length
…performance in the fit method (might want to revert)
|
I suppose 15% is quite substantial...
|
jnothman
left a comment
There was a problem hiding this comment.
Apart from nitpicks, the tests look good, so I trust the code (which also looks good but I've only looked at it briefly so far)
sklearn/neighbors/tests/test_kde.py
Outdated
|
|
||
|
|
||
| def test_kde_sample_weights(): | ||
| n_samples = 2500 |
There was a problem hiding this comment.
How long does this test take to run? This seems larger than necessary to prove the point
There was a problem hiding this comment.
5 seconds on my laptop. Indeed, I can decrease this to a few 100s and this would take ~1 sec.
sklearn/neighbors/tests/test_kde.py
Outdated
| X = rng.rand(n_samples, d) | ||
| weights = 1 + (10 * X.sum(axis=1)).astype(np.int8) | ||
| repetitions = [] | ||
| for x, w in zip(X, weights): |
| n_samples_test = size_test // d | ||
| test_points = rng.rand(n_samples_test, d) | ||
| for algorithm in ['auto', 'ball_tree', 'kd_tree']: | ||
| for metric in ['euclidean', 'minkowski', 'manhattan', |
There was a problem hiding this comment.
Any reason to believe minkowski without parameter would work differently?
sklearn/neighbors/kde.py
Outdated
| X = check_array(X, order='C', dtype=DTYPE) | ||
|
|
||
| if sample_weight is not None: | ||
| if not hasattr(sample_weight, 'shape'): |
| if not hasattr(sample_weight, 'shape'): | ||
| sample_weight = np.array(sample_weight) | ||
| if sample_weight.ndim != 1: | ||
| raise ValueError("the shape of sample_weight must be ({0},)," |
sklearn/neighbors/binary_tree.pxi
Outdated
| global_log_min_bound[0] = logaddexp(global_log_min_bound[0], | ||
| log_dens_contribution) | ||
| log_dens_contribution + | ||
| log_weight) |
There was a problem hiding this comment.
I think we'd usually not want this extra indentation. You could do:
(log_dens_contribution +
log_weight)
sklearn/neighbors/binary_tree.pxi
Outdated
| N2 = node_data[i2].idx_end - node_data[i2].idx_start | ||
| if with_sample_weight: | ||
| N1 = 0.0 | ||
| for i in range(node_data[i1].idx_start, node_data[i1].idx_end): |
There was a problem hiding this comment.
Please create a cdef inline to avoid repetition of this idiom.
So would it be better to revert commit 8ff81e5 ? |
|
If the 15% was also more than a few seconds, then yes, I think so.
…On 4 June 2018 at 19:16, Samuel O. Ronsin ***@***.***> wrote:
I suppose 15% is quite substantial...
So would it be better to revert commit 8ff81e5
<8ff81e5>
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10803 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wlqzfehX7FOcEdpX48E5uAnWb7Xks5t5PrzgaJpZM4SoOrd>
.
|
|
On a 10^5 x 20 sample, the difference is ~100ms. |
|
so a few seconds on 1e6 might be no big deal. But others' opinions may
differ
|
|
Is there something else I can do ? Should the title be [MRG+1] now ? |
|
it can be MRG+1 if you'd like... but we're sort of phasing that out in preference for GitHub's Approved thing. |
|
Ok, I didn't know that, thanks ! |
TomDLT
left a comment
There was a problem hiding this comment.
LGTM
Please also add an entry in doc/whats_new/v0.20.rst
| sample_weight.shape)) | ||
| check_consistent_length(X, sample_weight) | ||
| if sample_weight.min() <= 0: | ||
| raise ValueError("sample_weight must have positive values") |
There was a problem hiding this comment.
This is not covered by the tests
There was a problem hiding this comment.
Bien vu ! Added a test for that as well.
|
Thank you for a very nice contribution! |
|
Hi! Is this feature already available? I don't see it in the documentation of sklearn. |
|
It is not in the stable doc, but it is in the dev doc. |
Fix #4394
This PR adds sample weights support to kernel density estimation by adding a sample_weight array in the BinaryTree object and updating the traversal of the tree whenever sample weights are detected, all in the binary_tree.pyi Cython code.
This PR also updates the centroids of the BallTree to take the sample weights into account in ball_tree.pyx code.