[MRG+2] Ridge linear model dtype consistency (all solvers but sag)#9033
[MRG+2] Ridge linear model dtype consistency (all solvers but sag)#9033jnothman merged 7 commits intoscikit-learn:masterfrom
Conversation
|
All credit goes to @ncordier |
|
Oops, there is an assertion error! |
There was a problem hiding this comment.
Should we also add a test for _preprocess_data, directly on test_base.py, to check the dtype of X X_scale and X_offset? It would check it with all possible options, including having a sparse input X.
You also need to:
- update
_solve_cholesky_kernel, and test it (needsn_features>n_samples) - update
check_X_yin_RidgeGCVand testRidgeCV - test directly the function
ridge_regressionwhich is also public - test dtype consistency with sparse
Xinput matrix
| @@ -111,7 +111,7 @@ def _solve_cholesky(X, y, alpha): | |||
| return linalg.solve(A, Xy, sym_pos=True, | |||
There was a problem hiding this comment.
what happens in this case? Does linalg.solve returns the correct dtype? is it tested?
There was a problem hiding this comment.
Isn't this handled here If the right test is not passed eventually it would get propagated up and cached there.
There was a problem hiding this comment.
I double check it, it does passes through. A and Xy are float32 and float64 (as expected) and linalg.solve preserves the type.
So, all good. And both branches are tested.
sklearn/linear_model/ridge.py
Outdated
| else: | ||
| X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], | ||
| dtype=np.float64) | ||
| dtype='numeric') |
There was a problem hiding this comment.
any reason you use 'numeric' here and [np.float64, np.float32] below?
There was a problem hiding this comment.
Ignore this change. I don't know how it made it through. SAGA is not touched in this PR
There was a problem hiding this comment.
I went with numeric because I thought it was more generic. I think you are right though and we should go for [np.float64, np.float32] instead. However, the dtype consistency test should fail if you stick with np.float64.
There was a problem hiding this comment.
Are you sure the test currently fails when you stick with np.float64?
It seems to me that it only tests self.coef_. Not sure how to test it though...
There was a problem hiding this comment.
I can test it again. However, as I understand it, the goal of this PR is:
- to ensure dtype consistency between input (
X) and output (self.coef_), - to make sure the computations are using
Xwith its orginal dtype (to reduce memory footprint as asked in LogisticRegression convert to float64 #8769).
Wouldn't sticking with np.float64 cancel the second bullet point?
Just for reference: test_ridge.py does not appear to make use of unit tests, so the way I test the code is by adding the following lines at the end of my local version of test_ridge.py:
if __name__ == "__main__":
test_dtype_match()
There was a problem hiding this comment.
Wouldn't sticking with np.float64 cancel the second bullet point?
Yes it would, my point is that we don't have a test to prove it. I am not sure it is easy to test it though.
I test the code is by adding the following lines at the end of my local version
You can use nosetests sklearn/linear_model/tests/test_ridge.py -v or pytest sklearn/linear_model/tests/test_ridge.py -v
To run one particular test, you can use nosetests sklearn/linear_model/tests/test_ridge.py:test_dtype_match or pytest sklearn/linear_model/tests/test_ridge.py::test_dtype_match
sklearn/linear_model/ridge.py
Outdated
| else: | ||
| X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], | ||
| dtype=np.float64) | ||
| dtype='numeric') |
There was a problem hiding this comment.
Ignore this change. I don't know how it made it through. SAGA is not touched in this PR
sklearn/linear_model/ridge.py
Outdated
| X = check_array(X, accept_sparse=['csr', 'csc', 'coo'], | ||
| dtype=np.float64) | ||
| dtype=_dtype) | ||
| y = check_array(y, dtype='numeric', ensure_2d=False) |
There was a problem hiding this comment.
In order to have X and y with the exact same type (to take advantage of BLAS), what do you guys thing about the follwing line?
y = check_array(y, dtype=X.dtype, ensure_2d=False)we let X be whatever is in _dtype and then force X and y dtype be equal.
This is related to #8976 .
There was a problem hiding this comment.
y = check_array(y, dtype=X.dtype, ensure_2d=False)
+1
|
@TomDLT any other comments? |
sklearn/linear_model/base.py
Outdated
| X_offset, X_var = mean_variance_axis(X, axis=0) | ||
| if not return_mean: | ||
| X_offset = np.zeros(X.shape[1]) | ||
| X_offset = np.zeros(X.shape[1], dtype=X.dtype) |
There was a problem hiding this comment.
Would it be better to do:
X_offset[:] = 0
?
|
|
||
| # Do the actual checks at once for easier debug | ||
| assert_equal(coef_32.dtype, X_32.dtype) | ||
| assert_equal(coef_64.dtype, X_64.dtype) |
There was a problem hiding this comment.
You need one more empty line here to be pep8 complient
| assert_equal(coef_32.dtype, X_32.dtype) | ||
| assert_equal(coef_64.dtype, X_64.dtype) | ||
|
|
||
| def test_dtype_match_cholesky(): |
There was a problem hiding this comment.
I think that here it would be usable to have a comment saying that we have a different test than above because we are testing with multitarget.
|
A few details need to be addressed (including getting travis flake8 to like your PR), but I am going my +1 conditional on addressing these. |
|
besides +1 for MRG when Travis is happy |
|
@agramfort travis is happy, appveyor is not due to master. |
|
LGTM ! +1 for MRG after updating what's new |
|
Thanks @massich! |
| if self.solver in ['svd', 'sparse_cg', 'cholesky', 'lsqr']: | ||
| _dtype = [np.float64, np.float32] | ||
| else: | ||
| _dtype = np.float64 |
There was a problem hiding this comment.
@massich I forgot to comment here, it would be better to change the condition to assume that by default all solvers should accept both dtypes unless there is a known exception (preferably tracked by an issue):
if self.solver == 'lbfgs':
# scipy lbfgs does not support float32 yet:
# https://github.com/scipy/scipy/issues/4873
_dtype = np.float64
else:
# all other solvers work at both float precision levels
_dtype = [np.float64, np.float32]I have not tested the above change. Please feel free to do a PR if it works as expected.
There was a problem hiding this comment.
I made a mistake, I thought this was LogisticRegression. For Ridge the only unsupported solvers remaining are sag and saga.
There was a problem hiding this comment.
I think this comment should be in #8769, so that everyone can make sure that the default policy is to support both dtypes.
I'll post it there.
|
|
||
| # Do the actual checks at once for easier debug | ||
| assert_equal(coef_32.dtype, X_32.dtype) | ||
| assert_equal(coef_64.dtype, X_64.dtype) |
There was a problem hiding this comment.
You should also add a check for the dtypes of ridge_64.predict(X_64) and ridge_32.predict(X_32).
Reference Issue
works on #8769 for Ridge case
What does this implement/fix? Explain your changes.
Avoids Ridge to aggressively cast the data to
np.float64whennp.float32is supplied.Any other comments?