[MRG] Tests for no sparse y support in RandomForests#7996
[MRG] Tests for no sparse y support in RandomForests#7996dalmia wants to merge 8 commits intoscikit-learn:mainfrom
Conversation
|
We should be testing for a consistent error message text. use |
|
But note that the issue only really concerns classifiers. |
|
I.e. classifiers and not embeddings, regressors or outlier detectors. |
|
Just noticed something that the |
|
no, it is not. it's possible that a lot of things are missing from __all__
there.
…On 8 December 2016 at 14:32, Aman Dalmia ***@***.***> wrote:
Just noticed something that the assert_raise_message is not mentioned in
__all__ . Is that intentional?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69Huw92_SxnFV1LQX1RZLXEC1B1Zks5rF3pTgaJpZM4LGJ4_>
.
|
|
I found something weird. When I make a multilabel dataset: X_multilabel, y_multilabel = make_multilabel_classification(random_state=0, n_samples=100)
print(sp.issparse(X_multilabel))
But while testing the def check_no_sparse_y_support(name):
X, y = X_multilabel, sp.csr_matrix(y_multilabel)
TreeEstimator = ALL_TREES[name]
TreeEstimator(random_state=0).fit(X, y)
def test_no_sparse_y_support():
# Currently we don't support sparse y
for name in ALL_TREES:
check_no_sparse_y_support(name)I get the error that TypeError Traceback (most recent call last)
<ipython-input-37-296b45b2dfcc> in <module>()
----> 1 test_no_sparse_y_support()
<ipython-input-36-105d4a9cae8c> in test_no_sparse_y_support()
10 # Currently we don't support sparse y
11 for name in ALL_TREES:
---> 12 check_no_sparse_y_support(name)
<ipython-input-36-105d4a9cae8c> in check_no_sparse_y_support(name)
4 X, y = X_multilabel, sp.csr_matrix(y_multilabel)
5 TreeEstimator = ALL_TREES[name]
----> 6 TreeEstimator(random_state=0).fit(X, y)
7
8
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
1027 sample_weight=sample_weight,
1028 check_input=check_input,
-> 1029 X_idx_sorted=X_idx_sorted)
1030 return self
1031
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
121 if check_input:
122 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
--> 123 y = check_array(y, ensure_2d=False, dtype=None)
124 if issparse(X):
125 X.sort_indices()
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
378 if sp.issparse(array):
379 array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 380 force_all_finite)
381 else:
382 array = np.array(array, dtype=dtype, order=order, copy=copy)
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
241 """
242 if accept_sparse in [None, False]:
--> 243 raise TypeError('A sparse matrix was passed, but dense '
244 'data is required. Use X.toarray() to '
245 'convert to a dense numpy array.')
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.This doesn't seem right. Please let me know what you think. @jnothman |
|
no, it's talking about y, not about X. look at that traceback closely.
…On 8 December 2016 at 15:52, Aman Dalmia ***@***.***> wrote:
I found something weird. When I make a multilabel dataset:
X_multilabel, y_multilabel = make_multilabel_classification(random_state=0, n_samples=100)print(sp.issparse(X_multilabel))
False
But while testing the check_no_sparse_y_support (modified for
convenience):
def check_no_sparse_y_support(name):
X, y = X_multilabel, sp.csr_matrix(y_multilabel)
TreeEstimator = ALL_TREES[name]
TreeEstimator(random_state=0).fit(X, y)
def test_no_sparse_y_support():
# Currently we don't support sparse y
for name in ALL_TREES:
check_no_sparse_y_support(name)
I get the error that X is sparse:
TypeError Traceback (most recent call last)<ipython-input-37-296b45b2dfcc> in <module>()----> 1 test_no_sparse_y_support()
<ipython-input-36-105d4a9cae8c> in test_no_sparse_y_support()
10 # Currently we don't support sparse y
11 for name in ALL_TREES:---> 12 check_no_sparse_y_support(name)
<ipython-input-36-105d4a9cae8c> in check_no_sparse_y_support(name)
4 X, y = X_multilabel, sp.csr_matrix(y_multilabel)
5 TreeEstimator = ALL_TREES[name]----> 6 TreeEstimator(random_state=0).fit(X, y)
7
8
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
1027 sample_weight=sample_weight,
1028 check_input=check_input,-> 1029 X_idx_sorted=X_idx_sorted)
1030 return self
1031
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
121 if check_input:
122 X = check_array(X, dtype=DTYPE, accept_sparse="csc")--> 123 y = check_array(y, ensure_2d=False, dtype=None)
124 if issparse(X):
125 X.sort_indices()
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
378 if sp.issparse(array):
379 array = _ensure_sparse_format(array, accept_sparse, dtype, copy,--> 380 force_all_finite)
381 else:
382 array = np.array(array, dtype=dtype, order=order, copy=copy)
/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
241 """ 242 if accept_sparse in [None, False]:--> 243 raise TypeError('A sparse matrix was passed, but dense ' 244 'data is required. Use X.toarray() to ' 245 'convert to a dense numpy array.')TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
This doesn't seem right. Please let me know what you think. @jnothman
<https://github.com/jnothman>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66cALuC3swYmn9pYnIm4QIbuCmIXks5rF40QgaJpZM4LGJ4_>
.
|
|
but yes, perhaps the error message could be improved
…On 8 December 2016 at 15:56, Joel Nothman ***@***.***> wrote:
no, it's talking about y, not about X. look at that traceback closely.
On 8 December 2016 at 15:52, Aman Dalmia ***@***.***> wrote:
> I found something weird. When I make a multilabel dataset:
>
> X_multilabel, y_multilabel = make_multilabel_classification(random_state=0, n_samples=100)print(sp.issparse(X_multilabel))
>
> False
>
> But while testing the check_no_sparse_y_support (modified for
> convenience):
>
> def check_no_sparse_y_support(name):
> X, y = X_multilabel, sp.csr_matrix(y_multilabel)
> TreeEstimator = ALL_TREES[name]
> TreeEstimator(random_state=0).fit(X, y)
>
> def test_no_sparse_y_support():
> # Currently we don't support sparse y
> for name in ALL_TREES:
> check_no_sparse_y_support(name)
>
> I get the error that X is sparse:
>
> TypeError Traceback (most recent call last)<ipython-input-37-296b45b2dfcc> in <module>()----> 1 test_no_sparse_y_support()
> <ipython-input-36-105d4a9cae8c> in test_no_sparse_y_support()
> 10 # Currently we don't support sparse y
> 11 for name in ALL_TREES:---> 12 check_no_sparse_y_support(name)
> <ipython-input-36-105d4a9cae8c> in check_no_sparse_y_support(name)
> 4 X, y = X_multilabel, sp.csr_matrix(y_multilabel)
> 5 TreeEstimator = ALL_TREES[name]----> 6 TreeEstimator(random_state=0).fit(X, y)
> 7
> 8
> /media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
> 1027 sample_weight=sample_weight,
> 1028 check_input=check_input,-> 1029 X_idx_sorted=X_idx_sorted)
> 1030 return self
> 1031
> /media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
> 121 if check_input:
> 122 X = check_array(X, dtype=DTYPE, accept_sparse="csc")--> 123 y = check_array(y, ensure_2d=False, dtype=None)
> 124 if issparse(X):
> 125 X.sort_indices()
> /media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
> 378 if sp.issparse(array):
> 379 array = _ensure_sparse_format(array, accept_sparse, dtype, copy,--> 380 force_all_finite)
> 381 else:
> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
> /media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
> 241 """ 242 if accept_sparse in [None, False]:--> 243 raise TypeError('A sparse matrix was passed, but dense ' 244 'data is required. Use X.toarray() to ' 245 'convert to a dense numpy array.')TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
>
> This doesn't seem right. Please let me know what you think. @jnothman
> <https://github.com/jnothman>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7996 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz66cALuC3swYmn9pYnIm4QIbuCmIXks5rF40QgaJpZM4LGJ4_>
> .
>
|
|
Sorry, yes I meant to say that the error message includes |
|
Also, corrected a few PEP8 errors already existing in the files. |
|
It's fair enough to ping, but please wait a little more than two days!
there's far too much to get through
…On 10 December 2016 at 17:32, Aman Dalmia ***@***.***> wrote:
ping @jnothman <https://github.com/jnothman> @amueller
<https://github.com/amueller> @lesteve <https://github.com/lesteve>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yP7qagQ0fojSFh_JKTcy134Q6lIks5rGkd4gaJpZM4LGJ4_>
.
|
|
I assure you: my silence is not because I intend to ignore this issue.
…On 11 December 2016 at 20:59, Joel Nothman ***@***.***> wrote:
It's fair enough to ping, but please wait a little more than two days!
there's far too much to get through
On 10 December 2016 at 17:32, Aman Dalmia ***@***.***>
wrote:
> ping @jnothman <https://github.com/jnothman> @amueller
> <https://github.com/amueller> @lesteve <https://github.com/lesteve>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7996 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6yP7qagQ0fojSFh_JKTcy134Q6lIks5rGkd4gaJpZM4LGJ4_>
> .
>
|
|
@jnothman I was not really sure as to when is it appropriate to ping. Thank you for clearing it up. |
| n_samples=50) | ||
| y_sparse = csr_matrix(y) | ||
| ForestClassifier = FOREST_CLASSIFIERS[name] | ||
| assert_raise_message(ValueError, "Unknown label type: 'unknown'", |
There was a problem hiding this comment.
I don't think that's a useful error message. For sparse X not supported we raise TypeError (not that I'm sure this decision was correct) with message "A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array."
One option is to use check_X_y and extend upon its multioutput parameter with an allow_sparse_y parameter to standardise checking and error message.
sklearn/tree/tree.py
Outdated
| X = check_array(X, dtype=DTYPE, accept_sparse="csc") | ||
| y = check_array(y, ensure_2d=False, dtype=None) | ||
| y = check_array(y, ensure_2d=False, dtype=None, | ||
| accept_sparse=accept_sparse) |
There was a problem hiding this comment.
But surely accept_sparse should be False?
There was a problem hiding this comment.
I added this to ensure a consistent error message for the classification case. Since we are to change the message, I'll revert this.
Conflicts: sklearn/ensemble/forest.py
| y_sparse = csr_matrix(y) | ||
| ForestClassifier = FOREST_CLASSIFIERS[name] | ||
| assert_raise_message(TypeError, 'A sparse matrix was passed, but ' | ||
| 'dense data is required. Use X.toarray() to ' |
There was a problem hiding this comment.
I think referring to X.toarray() will only confuse the user.
There was a problem hiding this comment.
Yes, I agree that is an issue. But is there any other alternative to this because at the end we are relying on check_array to raise the error message?
|
Adding separately a check for sparse multilabel y seems the only option to produce a custom message if the current one seems confusing. |
|
check_array isn't set in stone. You could add a parameter for a variable
name placeholder to be used in error messages. Or make a modification to
check_X_y. This issue is best solved by modifying something in
utils.validation so that then estimators can be consistent in their error
messages.
…On 7 January 2017 at 03:08, Aman Dalmia ***@***.***> wrote:
Adding separately a check for sparse multilabel y seems the only option to
produce a custom message if the current one seems confusing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66feC5G_X7YoVKathrHpSb2Iwbftks5rPmcMgaJpZM4LGJ4_>
.
|
|
Something like this? |
|
Oh my bad. Please check now. |
|
I'm thinking that perhaps the addition of |
|
Yes, I agree. This seems an independent feature addition and seems cleaner to keep it in a separate PR. If I may have an overview as to at what places it might be needed, I can make that PR with the suitable change. |
|
I think this needs a rebuild: ======================================================================
FAIL: sklearn.decomposition.tests.test_pca.test_singular_values
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python35\lib\site-packages\nose\case.py", line 198, in runTest
self.test(*self.arg)
File "C:\Python35\lib\site-packages\sklearn\decomposition\tests\test_pca.py", line 266, in test_singular_values
np.linalg.norm(X_apca, "fro")**2.0, 12)
File "C:\Python35\lib\site-packages\numpy\testing\utils.py", line 842, in assert_array_almost_equal
precision=decimal)
File "C:\Python35\lib\site-packages\numpy\testing\utils.py", line 665, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 12 decimals
(mismatch 100.0%)
x: array(613.0583877392909)
y: array(613.0583877392893)
sklearn.decomposition.tests.test_dict_learning.test_dict_learning_reconstruction_parallel: 4.4997s |
|
Use git grep to find possible places for variable_name
…On 8 January 2017 at 20:02, Aman Dalmia ***@***.***> wrote:
I think this needs a rebuild:
======================================================================FAIL: sklearn.decomposition.tests.test_pca.test_singular_values----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python35\lib\site-packages\nose\case.py", line 198, in runTest
self.test(*self.arg)
File "C:\Python35\lib\site-packages\sklearn\decomposition\tests\test_pca.py", line 266, in test_singular_values
np.linalg.norm(X_apca, "fro")**2.0, 12)
File "C:\Python35\lib\site-packages\numpy\testing\utils.py", line 842, in assert_array_almost_equal
precision=decimal)
File "C:\Python35\lib\site-packages\numpy\testing\utils.py", line 665, in assert_array_compare
raise AssertionError(msg)AssertionError:
Arrays are not almost equal to 12 decimals
(mismatch 100.0%)
x: array(613.0583877392909)
y: array(613.0583877392893)
sklearn.decomposition.tests.test_dict_learning.test_dict_learning_reconstruction_parallel: 4.4997s
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67FI0xYsnKnRrIR2fbuFNBHiR50Vks5rQKYhgaJpZM4LGJ4_>
.
|
|
Would you prefer passing the |
|
if data is a parameter name within the api, then data too.
…On 9 Jan 2017 4:37 pm, "Aman Dalmia" ***@***.***> wrote:
Would you prefer passing the variable_name for checks like:
data = check_array(data, accept_sparse='csr') or should I restrict myself
to changing it wherever y appears?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7996 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xBcBR-LDiA6CF1wT2RktnUo4ynlks5rQcesgaJpZM4LGJ4_>
.
|
|
I assume @dalmia, you'd like someone else to complete this? |
|
Since #9068, |
|
Closing due to lack of activity from the OP. Happy to reopen if you get to work on this further. |
Reference Issue
Fixes #7886
What does this implement/fix? Explain your changes.
This adds the test that we don't support sparse
yas of yet for multilabel estimators.Any other comments?
Since
RandomTreesEmbeddingdoesn't make use ofy, so I was not really clear as to how should it be included in the tests and hence, have left it out currently. Also, I did cross the maximum characters limit for the sake of clarity by a very small margin and noticed that there are other such lines (bigger) present, hence, chose to stick to it.