ENH Consistent checks for sample weights in linear models#15530
ENH Consistent checks for sample weights in linear models#15530rth merged 7 commits intoscikit-learn:masterfrom
Conversation
| number. | ||
|
|
||
| sample_weight : float or numpy array of shape (n_samples,), default=None | ||
| sample_weight : float or array-like of shape (n_samples,), default=None |
There was a problem hiding this comment.
Passing float seems like a bug
There was a problem hiding this comment.
It is treated as float * np.ones_like(y). So not really a bug. Any further thoughts?
There was a problem hiding this comment.
Yes, that's how we treat it in _check_sample_weight. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR
| Target values | ||
|
|
||
| sample_weight : float or numpy array of shape [n_samples] | ||
| sample_weight : float or array-like of shape (n_samples,), default=None |
There was a problem hiding this comment.
Thanks @lorentzenchr ! LGTM, aside for the 2nd comment above. We merged other such PR without additional tests, as there are already common tests for sample weights, and this is mostly a refactoring (with a few additional checks).
| number. | ||
|
|
||
| sample_weight : float or numpy array of shape (n_samples,), default=None | ||
| sample_weight : float or array-like of shape (n_samples,), default=None |
There was a problem hiding this comment.
Yes, that's how we treat it in _check_sample_weight. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR
|
|
||
| sample_weight = _check_sample_weight(sample_weight, X, dtype=X.dtype) | ||
| X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=[np.float64], | ||
| multi_output=True, y_numeric=True) |
There was a problem hiding this comment.
What's the motivation for moving this here? I would revert to avoid cosmetic changes. In this case it would allow to fail early for wrong alphas, but that's shouldn't be a very common occurrence to optimize for.
There was a problem hiding this comment.
_check_sample_weight uses X.dtype, therefore, as in most other places, I'd like to first check X and y and then sample_weight.
There was a problem hiding this comment.
Yes, but both check_X_y and _check_sample_weight were already used here before in that order. At least that's what the diff on Github says... All this does is to skip the sample weight test if sample_weight is not None and so I wondering why we needed to move it around.
There was a problem hiding this comment.
I think, my motivation was to have both checks together. So either move _check_sample_weight up or check_X_ydown. As if np.any(self.alpha <= 0) seems to be the cheaper check, I decided for the latter. I'll revert that and do the former.
| check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], | ||
| multi_output=True) | ||
| X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], | ||
| multi_output=True, y_numeric=False) |
There was a problem hiding this comment.
Not really, but it is intricate. RidgeClassifierCV.fit calls _BaseRidgeCV.fit, which calls either Ridge.fitor _RidgeGCV.fit. The last two also do the usual X, y = check_X_y. Nevertheless, I prefer to have it explicitly there.
|
I don't mind either way, but so far this PR does mostly refactoring and style improvement. If you are going to address #15438 I imagine that might require a change of behavior? That might be easier to review in a separate bug fix PR with tests. In that case removing WIP tag might get help getting a second review. BTW if you are planning to add generic sample weight tests in the future, it might be worthwhile to add some of those directly as a |
|
@agramfort Any other comments on this PR? |
agramfort
left a comment
There was a problem hiding this comment.
beside my nitpick about having consistent docstrings for sample_weight LGTM
| sample_weight : float or numpy array of shape (n_samples,), default=None | ||
| sample_weight : float or array-like of shape (n_samples,), default=None | ||
| Individual weights for each sample. If sample_weight is not None and | ||
| solver='auto', the solver will be set to 'cholesky'. |
There was a problem hiding this comment.
docstring should say what sample_weight as float means. It's not "individual weights" for each sample.
There was a problem hiding this comment.
I hope you approve my short explanation on floats.
sklearn/linear_model/_ridge.py
Outdated
|
|
||
| sample_weight : float or numpy array of shape [n_samples] | ||
| sample_weight : float or array-like of shape (n_samples,), default=None | ||
| Individual weights for each sample |
There was a problem hiding this comment.
please clarify docstring here too
Reference Issues/PRs
Fixes #15358 for
linear_model.What is done in this PR:
Use
_check_sample_weightthroughout linear models.