Skip to content

[MRG] Errors for pandas sparse arrays as target#14125

Merged
glemaitre merged 7 commits intoscikit-learn:masterfrom
thomasjpfan:type_of_target_sparse
Jul 19, 2019
Merged

[MRG] Errors for pandas sparse arrays as target#14125
glemaitre merged 7 commits intoscikit-learn:masterfrom
thomasjpfan:type_of_target_sparse

Conversation

@thomasjpfan
Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan commented Jun 19, 2019

Reference Issues/PRs

Resolves #14002
Resolves #14005

What does this implement/fix? Explain your changes.

With pandas >= 0.24, we can support pandas sparse arrays. #14005 (comment) BUT given the nature of how pandas uses np.nan as the zero value, this PR will continue to raise an error for pandas sparse arrays

a = pd.SparseArray([1, np.nan, 2, 1, np.nan])

np.array(a)                    
# array([ 1., nan,  2.,  1., nan])

np.array(pd.SparseSeries(a))                      
# array([ 1., nan,  2.,  1., nan])

np.array(pd.Series(a))         
# array([ 1., nan,  2.,  1., nan])

Copy link
Copy Markdown
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

raise ValueError("y cannot be class 'SparseSeries'.")
sparse_pandas = (y.__class__.__name__ in ['SparseSeries', 'SparseArray'])
if sparse_pandas:
with suppress(ImportError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jun 19, 2019 via email

@thomasjpfan
Copy link
Copy Markdown
Member Author

thomasjpfan commented Jun 19, 2019

CI error is unrelated.

Should there be a sparse efficiency warning?

When a user calls type_of_target?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jun 19, 2019 via email

@glemaitre
Copy link
Copy Markdown
Member

Should there be a sparse efficiency warning?

I am split with the idea. Ideally, we should but it might be noisy knowing that we convert only the target.

@thomasjpfan
Copy link
Copy Markdown
Member Author

thomasjpfan commented Jun 25, 2019

I am split with the idea. Ideally, we should but it might be noisy knowing that we convert only the target.

I am okay with an efficient warning in check_array. Because pandas sparse series was not supported, thus adding a warning would not introduce any new warnings.

@thomasjpfan
Copy link
Copy Markdown
Member Author

I can see three ways to handle this:

  1. Convert the pandas sparse array into a scipy sparse object. (This may not be worth it)
  2. Warn in check_array when a pandas sparse array is passed.
  3. Continue raising an exception for any pandas sparse array.

@thomasjpfan thomasjpfan changed the title [MRG] Only errors for pandas sparse arrays when pandas version < 0.24 [MRG] Errors for pandas sparse arrays as target Jul 4, 2019
@thomasjpfan
Copy link
Copy Markdown
Member Author

Given that pandas using np.nan as the zero value for their sparse array/series this may lead to confusion. This PR was updated to always raise an error when using pandas sparse arrays as a target.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jul 4, 2019 via email

@thomasjpfan
Copy link
Copy Markdown
Member Author

Ah yes, it can be specified: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html using fill_value

At the moment this PR only updates a test to use SparseArray, because SparseSeries is deprecated, and checks for both in type_of_target.

@amueller
Copy link
Copy Markdown
Member

I haven't followed the whole discussion but this seems good?

@glemaitre glemaitre merged commit cc64397 into scikit-learn:master Jul 19, 2019
@glemaitre
Copy link
Copy Markdown
Member

Thanks @thomasjpfan

@jnothman jnothman mentioned this pull request Jul 28, 2019
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SparseSeries deprecated: scipy-dev failing on travis

5 participants