[MRG] Large sparse matrix support#11327
Conversation
…sparse to check_array and new functon sparse_indices_check
…into sparse_indices_check Merging changes between fork and master:
…arse validation function has been extended to COO and CSC too
…into work Aligning with the top
|
|
||
| def test_check_array_accept_large_sparse_no_exception(X_64bit): | ||
| # When large sparse are allowed | ||
| if LARGE_SPARSE_SUPPORTED: |
There was a problem hiding this comment.
Maybe mark this with pytest.mark.skipif(not LARGE_SPARSE_SUPPORTED) instead? otherwise we will be generating 64 bit sparse in the fixture with a scipy that doesn't support it.
There was a problem hiding this comment.
No, we want to do this: we want to test what happens if the user passes one that they could not have constructed directly with scipy.
|
|
||
|
|
||
| def test_check_array_accept_large_sparse_raise_exception(X_64bit): | ||
| print(X_64bit) |
There was a problem hiding this comment.
And strangely, I think the forgotten print statement was causing the test failure on old scipy
sklearn/utils/validation.py
Outdated
| if isinstance(accept_sparse, six.string_types): | ||
| accept_sparse = [accept_sparse] | ||
|
|
||
| # Indices Datatype regulation |
|
Tests are now passing.
|
ogrisel
left a comment
There was a problem hiding this comment.
LGTM besides the following comments.
sklearn/decomposition/nmf.py
Outdated
|
|
||
| X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float) | ||
| X = check_array(X, accept_sparse=('csr', 'csc'), | ||
| dtype=float) |
There was a problem hiding this comment.
cosmetics: this change looks useless.
sklearn/decomposition/nmf.py
Outdated
| """ | ||
| X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float) | ||
| X = check_array(X, accept_sparse=('csr', 'csc'), | ||
| dtype=float) |
|
|
||
| accept_large_sparse : bool (default=True) | ||
| If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by | ||
| accept_sparse, accept_large_sparse will cause it to be accepted only |
There was a problem hiding this comment.
accept_large_sparse=False instead of just accept_large_sparse.
sklearn/utils/validation.py
Outdated
| " to 0.14.0 or above" % scipy_version) | ||
| raise TypeError("Only sparse matrices with 32-bit integer" | ||
| " indices are accepted. Got %s indices." | ||
| % indices_datatype) |
There was a problem hiding this comment.
Other scikit-learn input validation checks tend to raise ValueError when the type of the container is ok (array, sparse matrix, dataframe) but the dtype is invalid:
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1])
Traceback (most recent call last):
File "<ipython-input-8-d601c7dabfdc>", line 1, in <module>
LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1])
File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
order="C")
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 494, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'invalid'Therefore we might want to raise ValueError here.
There was a problem hiding this comment.
Hmm. We raise TypeError for wrong sparse format. I can change to ValueError...?
There was a problem hiding this comment.
I can change this, but I feel like TypeError is more precise.
|
|
||
|
|
||
| def _generate_sparse_matrix(X_csr): | ||
| """Generate sparse matrices with {32,64}bit indices of diverse format |
There was a problem hiding this comment.
Side comment: once this is merged, I think there are some other existing tests where this function could be useful (cc @glemaitre )
|
Want to give an opinion on |
|
I think I'm still in favor (+0) of |
|
One more data point: we explicitly reject complex dtyped arrays with >>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
Traceback (most recent call last):
File "<ipython-input-9-70badd21e619>", line 1, in <module>
LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
order="C")
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 497, in check_array
"{}\n".format(array))
ValueError: Complex data not supported
[[1.2+1.j]] |
|
@rth does this have your +1? |
rth
left a comment
There was a problem hiding this comment.
I don't have a strong opinion on the TypeError vs ValueError. I feel that for complex arrays (and possibly 32/64 bit indices), a TypeError would have made more sense. At least scipy does raise it in case of dtype mismatch (see e.g. scipy/scipy#8360 (comment) or this line) But ValueError seems fine for consistency sake.
LGTM. Thanks @jnothman !
|
Yay! Great to close some old issues!
|
Supersedes and closes #9678, given @kdhingra307's silence
Fixes #9545
Fixes #4149
Partially addresses #2969
TODO: