Skip to content

Horrible regression coefficients in LinearRegression with collinearity #9073

@naught101

Description

@naught101

Example:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


X = pd.DataFrame(np.random.random((1000, 3)))
X[3] = X[2] + 2

y = np.random.random(1000)

lr = LinearRegression()
lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

print(lr.coef_)
# array([ -1.55090722e-02,  -1.76455116e-02,  -3.15745116e+11, 3.15745116e+11])

print(lr.intercept_)
# -631490232794.42896

These last two regression coefficients are technically correct, because the problem is undetermined, but they are useless for prediction. It would be much nicer if they were set to something a bit more sane, and the intercept adjusted accordingly.

Perhaps it would make sense to check for perfect collinearity when the intercept is above some threshold, and warn about it.

Then perhaps it would make sense to re-regress with one of the variables removed, and then set both of their coefs to half of the individual coeffcient? This would at least produce sane predictions in most cases, but perhaps there are cases where this would be bad, and it should be left up to the user.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions