[MRG+2] Ridge linear model dtype consistency (all solvers but sag) by massich · Pull Request #9033 · scikit-learn/scikit-learn

massich · 2017-06-07T09:38:37Z

Reference Issue

works on #8769 for Ridge case

What does this implement/fix? Explain your changes.

Avoids Ridge to aggressively cast the data to np.float64 when np.float32 is supplied.

Any other comments?

massich · 2017-06-08T15:12:22Z

All credit goes to @ncordier

ncordier · 2017-06-08T17:48:57Z

Oops, there is an assertion error!

FAIL: sklearn.linear_model.tests.test_ridge.test_dtype_match

----------------------------------------------------------------------

[...]

  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/linear_model/tests/test_ridge.py", line 817, in test_dtype_match

    assert_almost_equal(ridge_32.coef_, ridge_64.coef_.astype(np.float32))

[...]

AssertionError: 

Arrays are not almost equal to 7 decimals

(mismatch 20.0%)

 x: array([-0.03347596, -0.28948903, -0.15810712,  0.09406258,  0.37783372], dtype=float32)

 y: array([-0.0334758 , -0.28948909, -0.15810704,  0.09406273,  0.37783384], dtype=float32)

TomDLT

Should we also add a test for _preprocess_data, directly on test_base.py, to check the dtype of X X_scale and X_offset? It would check it with all possible options, including having a sparse input X.

You also need to:

update _solve_cholesky_kernel, and test it (needs n_features > n_samples)
update check_X_y in _RidgeGCV and test RidgeCV
test directly the function ridge_regression which is also public
test dtype consistency with sparse X input matrix

TomDLT · 2017-06-09T09:17:27Z

sklearn/linear_model/ridge.py

@@ -111,7 +111,7 @@ def _solve_cholesky(X, y, alpha):
        return linalg.solve(A, Xy, sym_pos=True,


what happens in this case? Does linalg.solve returns the correct dtype? is it tested?

Isn't this handled here If the right test is not passed eventually it would get propagated up and cached there.

I double check it, it does passes through. A and Xy are float32 and float64 (as expected) and linalg.solve preserves the type.

So, all good. And both branches are tested.

TomDLT · 2017-06-09T09:25:53Z

sklearn/linear_model/ridge.py

    else:
        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
-                        dtype=np.float64)
+                        dtype='numeric')


any reason you use 'numeric' here and [np.float64, np.float32] below?

Ignore this change. I don't know how it made it through. SAGA is not touched in this PR

I went with numeric because I thought it was more generic. I think you are right though and we should go for [np.float64, np.float32] instead. However, the dtype consistency test should fail if you stick with np.float64.

Are you sure the test currently fails when you stick with np.float64?
It seems to me that it only tests self.coef_. Not sure how to test it though...

I can test it again. However, as I understand it, the goal of this PR is:

to ensure dtype consistency between input (X) and output (self.coef_),

to make sure the computations are using X with its orginal dtype (to reduce memory footprint as asked in LogisticRegression convert to float64 #8769).

Wouldn't sticking with np.float64 cancel the second bullet point?

Just for reference: test_ridge.py does not appear to make use of unit tests, so the way I test the code is by adding the following lines at the end of my local version of test_ridge.py:

if __name__ == "__main__": test_dtype_match()

Wouldn't sticking with np.float64 cancel the second bullet point?

Yes it would, my point is that we don't have a test to prove it. I am not sure it is easy to test it though.

I test the code is by adding the following lines at the end of my local version

You can use nosetests sklearn/linear_model/tests/test_ridge.py -v or pytest sklearn/linear_model/tests/test_ridge.py -v

To run one particular test, you can use nosetests sklearn/linear_model/tests/test_ridge.py:test_dtype_match or pytest sklearn/linear_model/tests/test_ridge.py::test_dtype_match

massich

@TomDLT this can be handled in a different PR like #9040

Should we also add a test for _preprocess_data, directly on test_base.py, to check the dtype of X X_scale and X_offset? It would check it with all possible options, including having a sparse input X.

massich · 2017-06-09T10:01:24Z

sklearn/linear_model/ridge.py

    else:
        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
-                        dtype=np.float64)
+                        dtype='numeric')


Ignore this change. I don't know how it made it through. SAGA is not touched in this PR

ncordier

Edit: Nevermind, I will be replying directly next to the line of the code.

massich · 2017-06-09T15:54:01Z

sklearn/linear_model/ridge.py

        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
-                        dtype=np.float64)
+                        dtype=_dtype)
        y = check_array(y, dtype='numeric', ensure_2d=False)


In order to have X and y with the exact same type (to take advantage of BLAS), what do you guys thing about the follwing line?

y = check_array(y, dtype=X.dtype, ensure_2d=False)

we let X be whatever is in _dtype and then force X and y dtype be equal.
This is related to #8976 .

cc: @TomDLT, @raghavrv, @GaelVaroquaux

y = check_array(y, dtype=X.dtype, ensure_2d=False)

+1

massich · 2017-06-09T16:40:24Z

@TomDLT any other comments?

GaelVaroquaux · 2017-06-10T11:02:50Z

sklearn/linear_model/base.py

            X_offset, X_var = mean_variance_axis(X, axis=0)
            if not return_mean:
-                X_offset = np.zeros(X.shape[1])
+                X_offset = np.zeros(X.shape[1], dtype=X.dtype)


Would it be better to do:

X_offset[:] = 0

?

GaelVaroquaux · 2017-06-10T11:05:51Z

sklearn/linear_model/tests/test_ridge.py

+
+        # Do the actual checks at once for easier debug
+        assert_equal(coef_32.dtype, X_32.dtype)
+        assert_equal(coef_64.dtype, X_64.dtype)


You need one more empty line here to be pep8 complient

GaelVaroquaux · 2017-06-10T11:07:18Z

sklearn/linear_model/tests/test_ridge.py

+        assert_equal(coef_32.dtype, X_32.dtype)
+        assert_equal(coef_64.dtype, X_64.dtype)
+
+def test_dtype_match_cholesky():


I think that here it would be usable to have a comment saying that we have a different test than above because we are testing with multitarget.

GaelVaroquaux · 2017-06-10T11:09:24Z

A few details need to be addressed (including getting travis flake8 to like your PR), but I am going my +1 conditional on addressing these.

agramfort · 2017-06-10T15:12:00Z

besides +1 for MRG when Travis is happy

massich · 2017-06-14T12:27:15Z

@agramfort travis is happy, appveyor is not due to master.

agramfort · 2017-06-14T12:40:03Z

LGTM !

+1 for MRG after updating what's new

jnothman · 2017-06-14T13:31:25Z

Thanks @massich!

ogrisel · 2017-06-14T13:38:39Z

sklearn/linear_model/ridge.py

+        if self.solver in ['svd', 'sparse_cg', 'cholesky', 'lsqr']:
+            _dtype = [np.float64, np.float32]
+        else:
+            _dtype = np.float64


@massich I forgot to comment here, it would be better to change the condition to assume that by default all solvers should accept both dtypes unless there is a known exception (preferably tracked by an issue):

if self.solver == 'lbfgs': # scipy lbfgs does not support float32 yet: # https://github.com/scipy/scipy/issues/4873 _dtype = np.float64 else: # all other solvers work at both float precision levels _dtype = [np.float64, np.float32]

I have not tested the above change. Please feel free to do a PR if it works as expected.

I made a mistake, I thought this was LogisticRegression. For Ridge the only unsupported solvers remaining are sag and saga.

I think this comment should be in #8769, so that everyone can make sure that the default policy is to support both dtypes.
I'll post it there.

ogrisel · 2017-06-14T13:45:29Z

sklearn/linear_model/tests/test_ridge.py

+
+        # Do the actual checks at once for easier debug
+        assert_equal(coef_32.dtype, X_32.dtype)
+        assert_equal(coef_64.dtype, X_64.dtype)


You should also add a check for the dtypes of ridge_64.predict(X_64) and ridge_32.predict(X_32).

…cikit-learn#9033)

massich changed the title ~~[WIP] Ridge linear model dtype consistency (svd)~~ [WIP] Ridge linear model dtype consistency (all solvers but sag) Jun 8, 2017

massich changed the title ~~[WIP] Ridge linear model dtype consistency (all solvers but sag)~~ [MRG] Ridge linear model dtype consistency (all solvers but sag) Jun 8, 2017

TomDLT reviewed Jun 9, 2017

View reviewed changes

massich mentioned this pull request Jun 9, 2017

LogisticRegression convert to float64 #8769

Closed

massich commented Jun 9, 2017

View reviewed changes

ncordier reviewed Jun 9, 2017

View reviewed changes

massich force-pushed the is/8769_ridge branch from 1ba2669 to 61ea4ab Compare June 9, 2017 12:25

Allow Ridge to support np.float32 data without upcasting to np.float64

5868abb

massich force-pushed the is/8769_ridge branch from 61ea4ab to 5868abb Compare June 9, 2017 12:27

add some missing casts

152a3f6

massich commented Jun 9, 2017

View reviewed changes

Joan Massich added 3 commits June 9, 2017 19:11

Address Tom's comment

43dddd4

force X y to have the same type to take advantage of acceleration

330c93a

fix check_array

a736e12

GaelVaroquaux reviewed Jun 10, 2017

View reviewed changes

GaelVaroquaux changed the title ~~[MRG] Ridge linear model dtype consistency (all solvers but sag)~~ [MRG+1] Ridge linear model dtype consistency (all solvers but sag) Jun 10, 2017

Address Gael's comments

e29a922

agramfort changed the title ~~[MRG+1] Ridge linear model dtype consistency (all solvers but sag)~~ [MRG+2] Ridge linear model dtype consistency (all solvers but sag) Jun 14, 2017

update whats_new

b1af745

jnothman merged commit 7cfec78 into scikit-learn:master Jun 14, 2017

massich deleted the is/8769_ridge branch June 14, 2017 13:29

ogrisel reviewed Jun 14, 2017

View reviewed changes

massich restored the is/8769_ridge branch June 14, 2017 15:13

massich mentioned this pull request Jun 14, 2017

[MRG+2] Minor changes in ridge float64 upcasting #9125

Merged

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) (s…

51808f4

…cikit-learn#9033)

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) (s…

1f4cbc6

…cikit-learn#9033)

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) (s…

3841459

…cikit-learn#9033)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) (s…

ee0fbbb

…cikit-learn#9033)

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) (s…

cdda0a4

…cikit-learn#9033)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) (s…

0655eba

…cikit-learn#9033)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+2] Ridge linear model dtype consistency (all solvers but sag) (s…

4157c22

…cikit-learn#9033)

		@@ -111,7 +111,7 @@ def _solve_cholesky(X, y, alpha):
		return linalg.solve(A, Xy, sym_pos=True,

Uh oh!

Conversation

massich commented Jun 7, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

massich commented Jun 8, 2017

Uh oh!

ncordier commented Jun 8, 2017

Uh oh!

TomDLT left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncordier Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncordier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massich Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massich commented Jun 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jun 10, 2017

Uh oh!

agramfort commented Jun 10, 2017

Uh oh!

massich commented Jun 14, 2017

Uh oh!

agramfort commented Jun 14, 2017

Uh oh!

jnothman commented Jun 14, 2017

Uh oh!

ogrisel Jun 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT left a comment •

edited

Loading

TomDLT Jun 9, 2017 •

edited

Loading

ncordier Jun 9, 2017 •

edited

Loading

ncordier left a comment •

edited

Loading

massich Jun 9, 2017 •

edited

Loading

ogrisel Jun 14, 2017 •

edited

Loading