ENH Adds support for pandas dataframe with only sparse arrays by thomasjpfan · Pull Request #16728 · scikit-learn/scikit-learn

thomasjpfan · 2020-03-20T02:29:18Z

Reference Issues/PRs

Fixes #12800

What does this implement/fix? Explain your changes.

Does not convert pandas dataframe into a dense array if ALL its columns are sparse arrays.

Any other comments?

Since we have pandas sparse support on our minds: CC @rth

rth

This is nice! Please add a what's new entry for check_array.

The conversions DataFrame (sparse) -> COO/CSC etc should be fairly efficient already now.

rth · 2020-03-20T08:16:22Z

sklearn/utils/validation.py


+    # handles pandas sparse by checking for sparse attribute
+    if hasattr(array, 'sparse') and array.ndim > 1:
+        array = array.sparse.to_coo()


Why COO and not CSC?

For now, .sparse only supports .to_coo().

rth · 2020-03-20T08:18:06Z

sklearn/utils/tests/test_validation.py

+    assert sp.issparse(result)
+    assert result.format == 'coo'
+    assert_allclose_dense_sparse(sp_mat, result)
+


Could we also add check that, e.g.

result = check_array(sdf, accept_sparse='csr') assert result.format == csr

PR updated with a parametrized test.

…port

adrinjalali

Nice, thanks @thomasjpfan

adrinjalali · 2020-03-23T10:20:38Z

Codecov says this is always false in our CI?

if LooseVersion(pd.__version__) < '0.24.0'

rth · 2020-03-23T10:28:57Z

Looks like codecov is more often wrong than right. Pandas 1.0 should be installed in pylatest_pip_openblas_pandas: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=14767&view=logs&j=78a0bf4f-79e5-5387-94ec-13e67d216d6e&t=75a90307-b084-59e7-ba24-7f7161804495&l=252

adrinjalali · 2020-03-23T11:45:04Z

But that's pandas 1.0, not something older than 0.24.

NicolasHug

Thanks @thomasjpfan , a few comments but looks good

NicolasHug · 2020-03-23T12:01:23Z

sklearn/utils/validation.py

    if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'):
-        # throw warning if pandas dataframe is sparse
+        # throw warning if columns are sparse. If all columns are sparse, then
+        # array.sparse exist and sparsity will be perserved.


Suggested change

# array.sparse exist and sparsity will be perserved.

# array.sparse exists and sparsity will be perserved (later).

NicolasHug · 2020-03-23T12:03:25Z

doc/whats_new/v0.23.rst

  :pr:`16021` by :user:`Rushabh Vasani <rushabh-v>`.

+- |Enhancement| :func:`utils.validation.check_array` now constructs a sparse
+  matrix from a pandas DataFrame with containing only `SparseArray`s.


Suggested change

matrix from a pandas DataFrame with containing only `SparseArray`s.

matrix from a pandas DataFrame that contains only `SparseArray`s.

NicolasHug · 2020-03-23T12:05:07Z

sklearn/utils/tests/test_validation.py

+    sdf = pd.DataFrame.sparse.from_spmatrix(sp_mat)
+    result = check_array(sdf, accept_sparse=True)
+
+    # by default pandas converts to coo


Is this pandas default, or is it rather what we did in check_array?

Is this pandas default, or is it rather what we did in check_array?

Both -- pandas only have direct conversion to coo at the moment, so it makes sense to keep that as default. Conversions from COO to either CSC or CSR is fast.

NicolasHug · 2020-03-23T12:06:55Z

sklearn/utils/validation.py

        estimator_name = "Estimator"
    context = " by %s" % estimator_name if estimator is not None else ""

+    # handles pandas sparse by checking for sparse attribute


Suggested change

# handles pandas sparse by checking for sparse attribute

# When all dataframe columns are sparse, convert to a sparse array

NicolasHug · 2020-03-23T12:10:20Z

sklearn/utils/tests/test_validation.py

+    assert_allclose_dense_sparse(sp_mat, result)
+
+
+@pytest.mark.parametrize('sp_format', ['csr', 'csc', 'coo', 'bsr'])


Maybe you can add the True option here to merge both tests?

Also as usual, a short comment describing the test would help for future us

thomasjpfan · 2020-03-26T23:15:27Z

I wonder if the coverage error has something to do with this error

Coverage.py warning: Data file '...' doesn't seem to be a coverage data file:
Couldn't use data file '/home/vsts/work/tmp_folder/.coverage.fv-az762.6807.984702':
no such table: coverage_schema

rth

Thanks @thomasjpfan , LGTM. I assume you need this for #16772 . Merging.

rth · 2020-03-27T11:09:04Z

But that's pandas 1.0, not something older than 0.24.

Indeed. Opened #16775 for a follow up discussion.

jorisvandenbossche · 2020-03-27T12:47:09Z

sklearn/utils/validation.py

    context = " by %s" % estimator_name if estimator is not None else ""

+    # When all dataframe columns are sparse, convert to a sparse array
+    if hasattr(array, 'sparse') and array.ndim > 1:


To be safe, I would also add a check that all columns are indeed sparse (the .sparse accessor now raises if not all columns are sparse, but I am not sure this is necessarily guaranteed to stay that way).
The explicit check would be something like:

df.dtypes.apply(pd.api.types.is_sparse).all()

See also pandas-dev/pandas#26706

I moved my comment to a new issue: #16845

…-learn#16728)

thomasjpfan added 6 commits March 19, 2020 19:57

ENH Adds pandas sparse support for check_array

13e2fd6

ENH Adds pandas sparse support for check_array

9c62a25

REV Remove comments

59deb0b

TST Improves test

93e8358

TST Improves test

f3550bd

DOC Better comments

0471d32

github-actions bot added module:linear_model module:utils labels Mar 20, 2020

rth reviewed Mar 20, 2020

View reviewed changes

thomasjpfan added 3 commits March 22, 2020 16:30

TST parametrizes checks

01dc2d2

DOC Adds whats_new

c56a498

Merge remote-tracking branch 'upstream/master' into pandas_sparse_sup…

ec0f403

…port

adrinjalali approved these changes Mar 23, 2020

View reviewed changes

NicolasHug reviewed Mar 23, 2020

View reviewed changes

CLN Address comments

19b7f6c

rth approved these changes Mar 27, 2020

View reviewed changes

rth mentioned this pull request Mar 27, 2020

Define minimal supported pandas version #16775

Closed

rth merged commit fa1ea2a into scikit-learn:master Mar 27, 2020

jorisvandenbossche reviewed Mar 27, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Mar 27, 2020

Recommended way to check for sparse data (DataFrame or Series) pandas-dev/pandas#26706

Open

jorisvandenbossche mentioned this pull request Apr 5, 2020

Improve check for sparse dataframes #16845

Closed

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH Adds support for pandas dataframe with only sparse arrays (scikit…

8a0b214

…-learn#16728)

dhorkel mentioned this pull request Jan 22, 2021

train_test_split stratify with a pandas Int32 dtype causes error #19250

Open

	# array.sparse exist and sparsity will be perserved.
	# array.sparse exists and sparsity will be perserved (later).

	matrix from a pandas DataFrame with containing only `SparseArray`s.
	matrix from a pandas DataFrame that contains only `SparseArray`s.

	# handles pandas sparse by checking for sparse attribute
	# When all dataframe columns are sparse, convert to a sparse array

		assert_allclose_dense_sparse(sp_mat, result)


		@pytest.mark.parametrize('sp_format', ['csr', 'csc', 'coo', 'bsr'])

Uh oh!

Conversation

thomasjpfan commented Mar 20, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Mar 23, 2020

Uh oh!

rth commented Mar 23, 2020

Uh oh!

adrinjalali commented Mar 23, 2020

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Mar 26, 2020

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 27, 2020

Uh oh!

jorisvandenbossche Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rth left a comment •

edited

Loading

jorisvandenbossche Mar 27, 2020 •

edited

Loading