Horrible regression coefficients in LinearRegression with collinearity

Example:

```{python}
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


X = pd.DataFrame(np.random.random((1000, 3)))
X[3] = X[2] + 2

y = np.random.random(1000)

lr = LinearRegression()
lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

print(lr.coef_)
# array([ -1.55090722e-02,  -1.76455116e-02,  -3.15745116e+11, 3.15745116e+11])

print(lr.intercept_)
# -631490232794.42896
```

These last two regression coefficients are technically correct, because the problem is undetermined, but they are useless for prediction. It would be much nicer if they were set to something a bit more sane, and the intercept adjusted accordingly.

Perhaps it would make sense to check for perfect collinearity when the intercept is above some threshold, and warn about it.

Then perhaps it would make sense to re-regress with one of the variables removed, and then set both of their coefs to half of the individual coeffcient? This would at least produce sane predictions in most cases, but perhaps there are cases where this would be bad, and it should be left up to the user.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Horrible regression coefficients in LinearRegression with collinearity #9073

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Horrible regression coefficients in LinearRegression with collinearity #9073

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions