ENH Adds support for pandas dataframe with only sparse arrays#16728
ENH Adds support for pandas dataframe with only sparse arrays#16728rth merged 10 commits intoscikit-learn:masterfrom
Conversation
|
|
||
| # handles pandas sparse by checking for sparse attribute | ||
| if hasattr(array, 'sparse') and array.ndim > 1: | ||
| array = array.sparse.to_coo() |
There was a problem hiding this comment.
For now, .sparse only supports .to_coo().
| assert sp.issparse(result) | ||
| assert result.format == 'coo' | ||
| assert_allclose_dense_sparse(sp_mat, result) | ||
|
|
There was a problem hiding this comment.
Could we also add check that, e.g.
result = check_array(sdf, accept_sparse='csr')
assert result.format == csr
There was a problem hiding this comment.
PR updated with a parametrized test.
adrinjalali
left a comment
There was a problem hiding this comment.
Nice, thanks @thomasjpfan
|
Codecov says this is always false in our CI? |
|
Looks like codecov is more often wrong than right. Pandas 1.0 should be installed in pylatest_pip_openblas_pandas: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=14767&view=logs&j=78a0bf4f-79e5-5387-94ec-13e67d216d6e&t=75a90307-b084-59e7-ba24-7f7161804495&l=252 |
|
But that's pandas 1.0, not something older than 0.24. |
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks @thomasjpfan , a few comments but looks good
sklearn/utils/validation.py
Outdated
| if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'): | ||
| # throw warning if pandas dataframe is sparse | ||
| # throw warning if columns are sparse. If all columns are sparse, then | ||
| # array.sparse exist and sparsity will be perserved. |
There was a problem hiding this comment.
| # array.sparse exist and sparsity will be perserved. | |
| # array.sparse exists and sparsity will be perserved (later). |
doc/whats_new/v0.23.rst
Outdated
| :pr:`16021` by :user:`Rushabh Vasani <rushabh-v>`. | ||
|
|
||
| - |Enhancement| :func:`utils.validation.check_array` now constructs a sparse | ||
| matrix from a pandas DataFrame with containing only `SparseArray`s. |
There was a problem hiding this comment.
| matrix from a pandas DataFrame with containing only `SparseArray`s. | |
| matrix from a pandas DataFrame that contains only `SparseArray`s. |
| sdf = pd.DataFrame.sparse.from_spmatrix(sp_mat) | ||
| result = check_array(sdf, accept_sparse=True) | ||
|
|
||
| # by default pandas converts to coo |
There was a problem hiding this comment.
Is this pandas default, or is it rather what we did in check_array?
There was a problem hiding this comment.
Is this pandas default, or is it rather what we did in check_array?
Both -- pandas only have direct conversion to coo at the moment, so it makes sense to keep that as default. Conversions from COO to either CSC or CSR is fast.
sklearn/utils/validation.py
Outdated
| estimator_name = "Estimator" | ||
| context = " by %s" % estimator_name if estimator is not None else "" | ||
|
|
||
| # handles pandas sparse by checking for sparse attribute |
There was a problem hiding this comment.
| # handles pandas sparse by checking for sparse attribute | |
| # When all dataframe columns are sparse, convert to a sparse array |
| assert_allclose_dense_sparse(sp_mat, result) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize('sp_format', ['csr', 'csc', 'coo', 'bsr']) |
There was a problem hiding this comment.
Maybe you can add the True option here to merge both tests?
Also as usual, a short comment describing the test would help for future us
|
I wonder if the coverage error has something to do with this error Coverage.py warning: Data file '...' doesn't seem to be a coverage data file:
Couldn't use data file '/home/vsts/work/tmp_folder/.coverage.fv-az762.6807.984702':
no such table: coverage_schema |
rth
left a comment
There was a problem hiding this comment.
Thanks @thomasjpfan , LGTM. I assume you need this for #16772 . Merging.
Indeed. Opened #16775 for a follow up discussion. |
| context = " by %s" % estimator_name if estimator is not None else "" | ||
|
|
||
| # When all dataframe columns are sparse, convert to a sparse array | ||
| if hasattr(array, 'sparse') and array.ndim > 1: |
There was a problem hiding this comment.
To be safe, I would also add a check that all columns are indeed sparse (the .sparse accessor now raises if not all columns are sparse, but I am not sure this is necessarily guaranteed to stay that way).
The explicit check would be something like:
df.dtypes.apply(pd.api.types.is_sparse).all()
See also pandas-dev/pandas#26706
Reference Issues/PRs
Fixes #12800
What does this implement/fix? Explain your changes.
Does not convert pandas dataframe into a dense array if ALL its columns are sparse arrays.
Any other comments?
Since we have pandas sparse support on our minds: CC @rth