-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Horrible regression coefficients in LinearRegression with collinearity #9073
Copy link
Copy link
Closed
Description
Example:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
X = pd.DataFrame(np.random.random((1000, 3)))
X[3] = X[2] + 2
y = np.random.random(1000)
lr = LinearRegression()
lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
print(lr.coef_)
# array([ -1.55090722e-02, -1.76455116e-02, -3.15745116e+11, 3.15745116e+11])
print(lr.intercept_)
# -631490232794.42896
These last two regression coefficients are technically correct, because the problem is undetermined, but they are useless for prediction. It would be much nicer if they were set to something a bit more sane, and the intercept adjusted accordingly.
Perhaps it would make sense to check for perfect collinearity when the intercept is above some threshold, and warn about it.
Then perhaps it would make sense to re-regress with one of the variables removed, and then set both of their coefs to half of the individual coeffcient? This would at least produce sane predictions in most cases, but perhaps there are cases where this would be bad, and it should be left up to the user.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels