MRG fix Normalize for linear models when used with sample_weight#19426
MRG fix Normalize for linear models when used with sample_weight#19426ogrisel merged 63 commits intoscikit-learn:mainfrom
Conversation
…maikia/scikit-learn into normalize_fix_for_sample_weight
|
this one should be good to go |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
|
@ogrisel let's see if it helps to understand the problem. I wrote from scratch a
This line basically means you take a scale that is given by n_samples import numpy as np
from numpy.testing import assert_allclose
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model._base import _preprocess_data
np.random.seed(0)
n_samples = 100
X, y = make_regression(n_samples=n_samples, n_features=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Check that setting the sw to 0 for the first 10 samples
# is equivalent to removing them:
sw = np.ones(X_train.shape[0])
sw[:10] = 0.
params = dict(
fit_intercept=True,
normalize=True,
alpha=1,
solver='lsqr'
)
ridge_sw_zero = Ridge(**params).fit(X_train, y_train, sample_weight=sw)
ridge_trimmed = Ridge(**params).fit(X_train[10:], y_train[10:])
def ridge(X, y, sample_weight, alpha, X_test, y_test):
X_mean = np.average(X, weights=sample_weight, axis=0)
X_var = (np.average(X**2, weights=sample_weight, axis=0) -
np.average(X, weights=sample_weight, axis=0)**2)
X_var *= X.shape[0] # difference between normalize and standardize
X_scale = np.sqrt(X_var)
X = (X - X_mean) / X_scale
y_mean = np.average(y, weights=sample_weight, axis=0)
y = y - y_mean
X *= np.sqrt(sample_weight)[:, None]
y *= np.sqrt(sample_weight)
coef = np.linalg.solve(
X.T @ X + alpha * np.eye(X.shape[1]), X.T @ y
)
coef = coef / X_scale
intercept = y_mean - np.dot(X_mean, coef)
y_pred = np.dot(X_test, coef) + intercept
print(r2_score(y_test, y_pred))
return y_pred
assert_allclose(
ridge_sw_zero.predict(X_test),
ridge(X_train, y_train, sw, params['alpha'], X_test, y_test),
rtol=1e-1
)
assert_allclose(
ridge_trimmed.predict(X_test),
ridge(X_train[10:], y_train[10:], sw[10:], params['alpha'], X_test, y_test),
rtol=1e-1
)
assert_allclose(
ridge(X_train, y_train, sw, params['alpha'], X_test, y_test),
ridge(X_train[10:], y_train[10:], sw[10:], params['alpha'], X_test, y_test),
rtol=1e-1
) |
|
here is what the new code after killing normalize should do. Just solving the problem with l-bfgs import numpy as np
from numpy.testing import assert_allclose
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from scipy.optimize import fmin_l_bfgs_b, check_grad
np.random.seed(0)
n_samples = 100
X, y = make_regression(n_samples=n_samples, n_features=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Check that setting the sw to 0 for the first 10 samples
# is equivalent to removing them:
sw = 10 * np.ones(X_train.shape[0])
sw[:10] = 0.
alpha = 1
def ridge_bfgs(X, y, sample_weight, alpha, X_test):
def f(w):
coef, intercept = w[:-1], w[-1]
return (
0.5 * np.sum(sample_weight * (y - X @ coef - intercept) ** 2) +
0.5 * alpha * np.sum(coef ** 2)
)
def fprime(w):
coef, intercept = w[:-1], w[-1]
residual = y - X @ coef - intercept
grad = np.empty(X.shape[1] + 1)
grad[:-1] = -X.T @ (sample_weight * residual) + alpha * coef
grad[-1] = -np.sum(sample_weight * residual)
return grad
dim = X.shape[1] + 1
print(check_grad(f, fprime, np.random.RandomState(42).randn(dim)))
w0 = np.zeros(dim)
w0[-1] = np.average(y, weights=sample_weight)
w = fmin_l_bfgs_b(f, w0, fprime)[0]
coef, intercept = w[:-1], w[-1]
y_pred = X_test @ coef + intercept
return y_pred
y_pred = ridge_bfgs(X_train, y_train, sw, alpha, X_test)
y_pred_trimmed = ridge_bfgs(X_train[10:], y_train[10:], sw[10:], alpha, X_test)
assert_allclose(
y_pred,
y_pred_trimmed,
rtol=1e-5
) |
|
Thanks @agramfort. I don't understand the need for the init of the intercept coef in the l-bfgs solver: w0[-1] = np.average(y, weights=sample_weight)since the problem is convex, initializing the intercept at 0 should work, no? I tried and it seems to work on my machine. @maikia I think it would be good to include Alex's last minimal snippet as a new test and compare that we can reproduce the same results with our |
|
yes w0[-1] = np.average(y, weights=sample_weight) is not necessary. It's
just a better init
but it should not change anything.
… |
|
@agramfort in the l-bfgs snippet the output of but the gradient expression seems correct to me. Do you have any idea why it's the case? dim is 4 in this case, so I would expect this value to be much smaller. |
|
I have extended the l-bfgs code snippet to compare with the scikit-learn https://gist.github.com/ogrisel/ea98add276dc9624fea8e41a5bb6c989 I have also extended (in the same gist) to add a check for the equivalence between fitting with |
|
The remaing failure happen in DetailsNote that in #19503 I extended the test to also cover the case where we pass I fail to see a pattern in the failures: it can either fail on float32 or float64 (but not always for either), sometimes it's all solver for a given tasks that are not consistent together with the SVD solver. Sometimes it's only the sparse-cg solver... |
|
Reading the error message more carefully, it seems that the PR causes some coefs to have |
|
Ok I think I fixed the problem with the handling of near constant features. |
|
I re-enabled the test for the failing edge case (non-zero constant feature on sparse data). I think we need to understand what's going on in this case. I am still not sure but at least the error is simpler now. Then we will need to understand the discrepancy with @agramfort's minimal ridge implementations. |
|
Ok I put more details in #19450 (comment) . I can solve the problem here but @maikia is right and we also have it in |
ogrisel
left a comment
There was a problem hiding this comment.
LGTM, let's merge this one. The remaining problems are properly documented in the linked issues.
|
Thanks @ogrisel ! |
_preprocess_datain_basefor linear models, when evoked withnormalize = Trueandsample_weightdoes not weight standard deviation and therefore does not behave as expected. This PR updates it.Currently we are working on deprecation of
normalize: #3020as pointed out by @ogrisel in #17743 (comment) when testing for the same result when running lasso or ridge with
normalizeset to True or in a pipeline with StandardScaler withnormalizeset to False, the test fails.After through investigation @agramfort realized the above mentioned issue. If not updated we cannot expect the pipeline with a StandardScaler to give the same results as linear_model with normalize set to True.
This issue has been also noted previously by @jnothman
scikit-learn/sklearn/linear_model/tests/test_base.py
Line 473 in 1ea7905
Possibly it might also related to: #17444
cc @glemaitre