ENH Sample weights for median_absolute_error#17225
ENH Sample weights for median_absolute_error#17225glemaitre merged 49 commits intoscikit-learn:masterfrom
Conversation
|
Used
ping @glemaitre :p |
There was a problem hiding this comment.
Thanks @lucyleeow , made a quick (and incomplete) pass.
I'm wondering whether it's worth extending _weighted_percentile, or if it'd be easier and faster to just have our own version of weighted_median? I would assume that computing a weighted median is simpler and faster than having a general algorithm that works for every percentile, though I haven't looked into the details
sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py
Outdated
Show resolved
Hide resolved
sklearn/metrics/_regression.py
Outdated
| output_errors = _weighted_percentile(np.abs(y_pred - y_true), | ||
| sample_weight=sample_weight) |
There was a problem hiding this comment.
I think we should avoid calling _weighted_percentile if sample_weight is None and still rely on np.median. The reason being that np.median is probably much faster than our more general _weighted_percentile (could be wrong on that, but I'd be surprised).
| err_msg="scorer {0} behaves differently when " | ||
| "ignoring samples and setting sample_weight to" | ||
| " 0: {1} vs {2}".format(name, weighted, | ||
| if name != 'neg_median_absolute_error': |
There was a problem hiding this comment.
can you help me undestand why you'd need this?
Also if we need to ignore neg_median_absolute_error, it would be preferable to filter it out before the loop for name, scorer in SCORERS.items():
There was a problem hiding this comment.
Since our data is binary classification data, with target being 1 or 0, there are only 2 possible values of np.abs(y_pred - y_true), 1 or 0. With only 2 possible values it's likely that median after 'randomly' removing/weighting as zero 10 elements (out of 25) is the same as the median with all 25 elements.
To filter before the loop would you make a copy of SCORER and remove neg_median_absolute_error before the loop? Is there a better way?
There was a problem hiding this comment.
I could instead use make_regression to create a y just for neg_median_absolute_error?
There was a problem hiding this comment.
I think using regression targets here for regression metrics seems reasonable...?
TBH, I think this test is weird. Why are we testing the metric properties of scorers when really the unit that this file should be testing is the wrapper around the metric? We should only be testing here that sample_weight is correctly passed to the metric, regardless of the parameters to make_scorer.
There was a problem hiding this comment.
Fair point. Though I would like to add more tests for _weighted_percentile wrt to weighted median. Specifically, that _weighted_percentile(percentile=50) with equal weights is same as np.median and also that the difference between sum of weights left and right of the weighted median are the smallest possible. Reference: https://en.wikipedia.org/wiki/Weighted_median#Properties
If you agree, should I add the tests to this PR or a different one?
Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.
With regard to this PR, should I use regression targets just for neg_median_absolute_error or all regression metrics?
There was a problem hiding this comment.
With regard to this PR, should I use regression targets just for neg_median_absolute_error or all regression metrics?
if it works for all, that would be more elegant.
Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.
Not sklearn/metrics/tests/test_regreession.py?
There was a problem hiding this comment.
Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.
Whoops I meant the tests for _weighted_percentile, which are in test_gradient_boosting_loss_functions.py. I would also like to add to tests for _weighted_percentile
There was a problem hiding this comment.
starting here:
|
ping @jnothman |
sklearn/utils/stats.py
Outdated
| import numpy as np | ||
|
|
||
| from .extmath import stable_cumsum | ||
| from sklearn.utils.fixes import _take_along_axis |
There was a problem hiding this comment.
It should be a relative import here
There was a problem hiding this comment.
| from sklearn.utils.fixes import _take_along_axis | |
| from .fixes import _take_along_axis |
sklearn/metrics/_regression.py
Outdated
| @_deprecate_positional_args | ||
| def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average'): | ||
| def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average', | ||
| sample_weight=None,): |
There was a problem hiding this comment.
Either remove the last comma or do
def median_absolute_error(
y_true, y_pred, *, multioutput='uniform_average',
sample_weight=None,
):
sklearn/utils/stats.py
Outdated
| import numpy as np | ||
|
|
||
| from .extmath import stable_cumsum | ||
| from sklearn.utils.fixes import _take_along_axis |
There was a problem hiding this comment.
| from sklearn.utils.fixes import _take_along_axis | |
| from .fixes import _take_along_axis |
sklearn/utils/fixes.py
Outdated
| def _take_along_axis(arr, indices, axis): | ||
| """Implements a simplified version of numpy.take_along_axis if numpy | ||
| version < 1.15""" | ||
| import numpy |
There was a problem hiding this comment.
Numpy is already imported as np
sklearn/utils/fixes.py
Outdated
| version < 1.15""" | ||
| import numpy | ||
|
|
||
| if numpy.__version__ >= LooseVersion('1.15'): |
There was a problem hiding this comment.
np_version is already containing the version
| if numpy.__version__ >= LooseVersion('1.15'): | |
| if np_version > (1, 14): |
sklearn/utils/tests/test_stats.py
Outdated
| w_median = _weighted_percentile(x_2d, w_2d) | ||
|
|
||
| for i, value in enumerate(w_median): | ||
| assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i])) |
There was a problem hiding this comment.
assert should not take parenthesis:
| assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i])) | |
| p = _weighted_percentile(x_2d[:, i], w_2d[:, i]) | |
| assert value == pytest.approx(p) |
There was a problem hiding this comment.
or
p_axis_0 = [
_weighted_percentile(x_2d[:, i], w_2d[:, i]) for i in range(len(w_median))
]
assert_allclose(w_median, p_axis_0)
sklearn/utils/tests/test_stats.py
Outdated
|
|
||
|
|
||
| def test_weighted_percentile_2d(): | ||
| # Check for when array is 2D |
There was a problem hiding this comment.
I think that we should test the case -> array 2-D and sample_weight 1-D
sklearn/utils/tests/test_stats.py
Outdated
| sw = np.ones(102, dtype=np.float64) | ||
| sw[-1] = 0.0 | ||
| score = _weighted_percentile(y, sw, 50) | ||
| assert score == 1 |
There was a problem hiding this comment.
if we have a floating point number, just use pytest.approx(...) here and after.
| # some of the regressions scorers require strictly positive input. | ||
| sensible_regr.fit(X_train, y_train + 1) | ||
| if y_reg_train is None: | ||
| sensible_regr.fit(X_train, y_train + 1) |
There was a problem hiding this comment.
y_train + 1 is to get only positive value? It seems that _require_positive_y would be much better no?
There was a problem hiding this comment.
I think so, not sure exactly what the purpose of sometimes using values >=0 and other times using values >=1. Will amend. Maybe this way is faster?
| X, y = make_classification(n_samples=101, random_state=0) | ||
| _, y_ml = make_multilabel_classification(n_samples=X.shape[0], | ||
| random_state=0) | ||
| _, y_reg = make_regression(n_samples=X.shape[0], n_features=X.shape[1], |
There was a problem hiding this comment.
It seems that test_scorer_sample_weight was designed to test classification score.
I think it would be best to create 2 tests function one dedicated to classification and one dedicated to regression. It will be easier to follow what is going on.
|
Yes it would be best
…On Tue, 26 May 2020 at 11:08, Lucy Liu ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/utils/stats.py
<#17225 (comment)>
:
> """
- sorted_idx = np.argsort(array)
+ n_dim = array.ndim
+ if n_dim == 0:
+ return array[()]
+ if array.ndim == 1:
+ array = array.reshape((-1, 1))
+ if array.shape != sample_weight.shape:
+ sample_weight = sample_weight.reshape(array.shape)
Thanks I didn't know this. Should I add a test for this case?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17225 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32P4AWYANONOPUCQGUSDRTOBKRANCNFSM4NA5AWNA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
|
Thanks for the thorough review @glemaitre. I've added the test for 2D array and 1D sample weight and split |
| err_msg="scorer {0} behaves differently when " | ||
| "ignoring samples and setting sample_weight to" | ||
| " 0: {1} vs {2}".format(name, weighted, | ||
| if name not in REGRESSION_SCORERS: |
There was a problem hiding this comment.
Wasn't sure how to do this outside of the for loop without copying the SCORER dict (@glemaitre)
There was a problem hiding this comment.
I think that this is fine. Another way could have been
if name in CLF_SCORERS:There was a problem hiding this comment.
Annoyingly CLF_SCORERS does not overlap with CLUSTER_SCORERS and MULTILABEL_ONLY_SCORERS.
(though REGRESSION_SCORERS overlaps with REQUIRE_POSITIVE_Y_SCORERS so I could adjust the classification test)
There was a problem hiding this comment.
Ah ok so it is good.
I might have done the following to avoid a level of indent and put a nice comment
if name in REGRESSION_SCORERS:
# skip the regression scores since we evaluate the classification scores
continue
glemaitre
left a comment
There was a problem hiding this comment.
apart of style, it looks good to me
| err_msg="scorer {0} behaves differently when " | ||
| "ignoring samples and setting sample_weight to" | ||
| " 0: {1} vs {2}".format(name, weighted, | ||
| if name not in REGRESSION_SCORERS: |
There was a problem hiding this comment.
I think that this is fine. Another way could have been
if name in CLF_SCORERS:|
The last changes. I think that we strictly do not check this part of the string so the tests did not fail but we should not print the right error message (without the variable values) |
Reference Issues/PRs
Follows from #6217
Addresses last item in #3450
What does this implement/fix? Explain your changes.
Add sample_weight to
median_absolute_error.Use
sklearn.utils.stats._weighted_percentileto calculated weighted median as suggested here: #6217 (comment)Amended
_weighted_percentileto calculate weighted percentile along axis=0 (for each column) if 2D array input. This is to allow multioutput. Does not change behaviour of_weighted_percentilewhen array 1D. Not sure if this is best way to implement thus will wait for comment before fixing tests.Any other comments?