Skip to content

[MRG] Tests for no sparse y support in RandomForests#7996

Closed
dalmia wants to merge 8 commits intoscikit-learn:mainfrom
dalmia:7886
Closed

[MRG] Tests for no sparse y support in RandomForests#7996
dalmia wants to merge 8 commits intoscikit-learn:mainfrom
dalmia:7886

Conversation

@dalmia
Copy link
Copy Markdown
Contributor

@dalmia dalmia commented Dec 7, 2016

Reference Issue

Fixes #7886

What does this implement/fix? Explain your changes.

This adds the test that we don't support sparse y as of yet for multilabel estimators.

Any other comments?

Since RandomTreesEmbedding doesn't make use of y, so I was not really clear as to how should it be included in the tests and hence, have left it out currently. Also, I did cross the maximum characters limit for the sake of clarity by a very small margin and noticed that there are other such lines (bigger) present, hence, chose to stick to it.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 7, 2016

We should be testing for a consistent error message text. use assert_raises_message

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 7, 2016

But note that the issue only really concerns classifiers.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 7, 2016

I.e. classifiers and not embeddings, regressors or outlier detectors.

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Dec 8, 2016

Just noticed something that the assert_raise_message is not mentioned in __all__ in sklearn.utils.testing. Is that intentional?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 8, 2016 via email

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Dec 8, 2016

I found something weird. When I make a multilabel dataset:

X_multilabel, y_multilabel = make_multilabel_classification(random_state=0, n_samples=100)
print(sp.issparse(X_multilabel))

False

But while testing the check_no_sparse_y_support (modified for convenience):

def check_no_sparse_y_support(name):
    X, y = X_multilabel, sp.csr_matrix(y_multilabel)
    TreeEstimator = ALL_TREES[name]
    TreeEstimator(random_state=0).fit(X, y)


def test_no_sparse_y_support():
    # Currently we don't support sparse y
    for name in ALL_TREES:
        check_no_sparse_y_support(name)

I get the error that X is sparse:

TypeError                                 Traceback (most recent call last)
<ipython-input-37-296b45b2dfcc> in <module>()
----> 1 test_no_sparse_y_support()

<ipython-input-36-105d4a9cae8c> in test_no_sparse_y_support()
     10     # Currently we don't support sparse y
     11     for name in ALL_TREES:
---> 12         check_no_sparse_y_support(name)

<ipython-input-36-105d4a9cae8c> in check_no_sparse_y_support(name)
      4     X, y = X_multilabel, sp.csr_matrix(y_multilabel)
      5     TreeEstimator = ALL_TREES[name]
----> 6     TreeEstimator(random_state=0).fit(X, y)
      7 
      8 

/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
   1027             sample_weight=sample_weight,
   1028             check_input=check_input,
-> 1029             X_idx_sorted=X_idx_sorted)
   1030         return self
   1031 

/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    121         if check_input:
    122             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
--> 123             y = check_array(y, ensure_2d=False, dtype=None)
    124             if issparse(X):
    125                 X.sort_indices()

/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    378     if sp.issparse(array):
    379         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
--> 380                                       force_all_finite)
    381     else:
    382         array = np.array(array, dtype=dtype, order=order, copy=copy)

/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn/sklearn/utils/validation.pyc in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
    241     """
    242     if accept_sparse in [None, False]:
--> 243         raise TypeError('A sparse matrix was passed, but dense '
    244                         'data is required. Use X.toarray() to '
    245                         'convert to a dense numpy array.')

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

This doesn't seem right. Please let me know what you think. @jnothman

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 8, 2016 via email

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 8, 2016 via email

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Dec 8, 2016

Sorry, yes I meant to say that the error message includes X which makes things a bit confusing. I found the error message for RandomForests as the most precise one. Because of sparse matrices not being allowed in y = check_array(y, ensure_2d=False, dtype=None), we weren't getting a consistent error message. Thought of a workaround, please review.

@dalmia dalmia changed the title [MRG] Tests for no sparse y support in RandomForests [WIP] Tests for no sparse y support in RandomForests Dec 8, 2016
@dalmia dalmia changed the title [WIP] Tests for no sparse y support in RandomForests [MRG] Tests for no sparse y support in RandomForests Dec 8, 2016
@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Dec 8, 2016

Also, corrected a few PEP8 errors already existing in the files.

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Dec 10, 2016

ping @jnothman @amueller @lesteve

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 11, 2016 via email

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Dec 11, 2016 via email

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Dec 11, 2016

@jnothman I was not really sure as to when is it appropriate to ping. Thank you for clearing it up.

n_samples=50)
y_sparse = csr_matrix(y)
ForestClassifier = FOREST_CLASSIFIERS[name]
assert_raise_message(ValueError, "Unknown label type: 'unknown'",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a useful error message. For sparse X not supported we raise TypeError (not that I'm sure this decision was correct) with message "A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array."

One option is to use check_X_y and extend upon its multioutput parameter with an allow_sparse_y parameter to standardise checking and error message.

X = check_array(X, dtype=DTYPE, accept_sparse="csc")
y = check_array(y, ensure_2d=False, dtype=None)
y = check_array(y, ensure_2d=False, dtype=None,
accept_sparse=accept_sparse)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But surely accept_sparse should be False?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to ensure a consistent error message for the classification case. Since we are to change the message, I'll revert this.

y_sparse = csr_matrix(y)
ForestClassifier = FOREST_CLASSIFIERS[name]
assert_raise_message(TypeError, 'A sparse matrix was passed, but '
'dense data is required. Use X.toarray() to '
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think referring to X.toarray() will only confuse the user.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that is an issue. But is there any other alternative to this because at the end we are relying on check_array to raise the error message?

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Jan 6, 2017

Adding separately a check for sparse multilabel y seems the only option to produce a custom message if the current one seems confusing.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jan 7, 2017 via email

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Jan 8, 2017

Something like this?

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Jan 8, 2017

Oh my bad. Please check now.

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also use the variable_name in the ensure_2d check.

FWIW, git grep 'check_array([^X]' sklearn/ suggests there are many more places in the codebase that variable_name should be used.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jan 8, 2017

I'm thinking that perhaps the addition of variable_name to check_array might benefit from being its own separate PR. WDYT?

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Jan 8, 2017

Yes, I agree. This seems an independent feature addition and seems cleaner to keep it in a separate PR. If I may have an overview as to at what places it might be needed, I can make that PR with the suitable change.

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Jan 8, 2017

I think this needs a rebuild:

======================================================================
FAIL: sklearn.decomposition.tests.test_pca.test_singular_values
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Python35\lib\site-packages\sklearn\decomposition\tests\test_pca.py", line 266, in test_singular_values
    np.linalg.norm(X_apca, "fro")**2.0, 12)
  File "C:\Python35\lib\site-packages\numpy\testing\utils.py", line 842, in assert_array_almost_equal
    precision=decimal)
  File "C:\Python35\lib\site-packages\numpy\testing\utils.py", line 665, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 12 decimals
(mismatch 100.0%)
 x: array(613.0583877392909)
 y: array(613.0583877392893)
sklearn.decomposition.tests.test_dict_learning.test_dict_learning_reconstruction_parallel: 4.4997s

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jan 8, 2017 via email

@dalmia
Copy link
Copy Markdown
Contributor Author

dalmia commented Jan 9, 2017

Would you prefer passing the variable_name for checks like:
data = check_array(data, accept_sparse='csr') or should I restrict myself to changing it wherever y appears?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jan 9, 2017 via email

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jun 8, 2017

I assume @dalmia, you'd like someone else to complete this?

@jnothman
Copy link
Copy Markdown
Member

Since #9068, check_array no longer mentions X

@adrinjalali
Copy link
Copy Markdown
Member

Closing due to lack of activity from the OP. Happy to reopen if you get to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test support for sparse multilabel format

5 participants