SparseArray is an ExtensionArray#22325
Conversation
|
The root problem is |
|
Yes, I understand that, but we can still manually convert the sparse boolean to a numpy boolean in places where we use it for indexing a numpy array? (although it would not be needed for newer numpy) |
jreback
left a comment
There was a problem hiding this comment.
some comments. you may want to address some here (as you have to rebase anyhow). or a followup ok.
| * Passing a scalar for ``indices`` is no longer allowed. | ||
| - The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``. | ||
| - ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray. | ||
| - Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed. |
There was a problem hiding this comment.
i agree the fill type should match the dtype but since missing value support is allowed here it is prob ok.
| - Added :meth:`pandas.api.types.register_extension_dtype` to register an extension type with pandas (:issue:`22664`) | ||
| - Series backed by an ``ExtensionArray`` now work with :func:`util.hash_pandas_object` (:issue:`23066`) | ||
| - Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`) | ||
| - :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`). |
There was a problem hiding this comment.
really? we allow this. I agree this would be ok, but is reasonably tested?
There was a problem hiding this comment.
Define reasonable :)
I'm reasonably sure there are places in pandas where we assume we have an ndarray, but may get an ExtensionArray instead.
The common case of .isna().any() is well tested though, I think.
There was a problem hiding this comment.
In any case, if you create the masks in other ways than .isna(), eg with a comparison like df['sparse_column'] == 1, you get exactly the same issue. Which means we basically have to support indexing with boolean sparse masks anyway I think.
| Sparse | ||
| ^^^^^^ | ||
|
|
||
| - Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`) |
There was a problem hiding this comment.
really, we have support for this? again i agree this is a nice feature, but we are decreasing support generally for sparse, so not anxious to advertise this
|
|
||
| This should return a 1-D array the same length as 'self'. | ||
| * ``na_values._is_boolean`` should be True | ||
| * `na_values` should implement :func:`ExtensionArray._reduce` |
There was a problem hiding this comment.
we should probably have an Indexing EA mixin that implementes these as NotImplemented (so once can subclass)
There was a problem hiding this comment.
Such an indexing mixing might be useful, but how is this related to the line above?
| else: | ||
| return g, None | ||
| try: | ||
| g = np.find_common_type(upcast_classes, []) |
There was a problem hiding this comment.
ok,, yeah would be nice to simplify this
| @@ -67,7 +67,11 @@ def _from_sequence(cls, scalars, dtype=None, copy=False): | |||
| return cls.from_scalars(scalars) | |||
|
|
|||
| def __getitem__(self, item): | |||
There was a problem hiding this comment.
see my comment above, we should create an Indexing EA mixin and use the interface here
Oh, I see what you're suggesting. I only see two of these FutureWarnings from numpy... Looking at quantile, right now it'd be fine to convert the ExtensionArray mask to an ndarray because we know that The other case is in SparseArray.take, where we create the ndarray, so that's safe. |
Did you do a change for this? (I only see something in the diff for percentile) |
Sorry forgot to post what I found. The LOC is return self.take(np.arange(len(key), dtype=np.int32)[key])We can't hit that with an SparseArray (we go to dense earlier). I see now that we're hitting it with something like In [4]: arr[[True, False, True]]
pandas/core/sparse/array.py:662: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
return self.take(np.arange(len(key), dtype=np.int32)[key])
Out[4]:
[2, 1, 2]
Fill: 0
IntIndex
Indices: array([0, 1, 2], dtype=int32)again this is only on older numpys. So we can do an asarray on key there. |
|
FYI, this boolean indexing behavior affects through numpy 1.11. |
|
All green, if anyone wants to take a look at the dtype raising commit before pushing the green button. |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Last changes look good, just a minor doc comment
pandas/core/sparse/dtype.py
Outdated
| @@ -173,9 +173,10 @@ def construct_from_string(cls, string): | |||
| 'Sparse[int, 1]' SparseDtype[np.int64, 0] | |||
There was a problem hiding this comment.
I think this one now needs to be updated as the above would raise an error. But make it 'Sparse[int, 0] to show that default fill value is OK?
|
Good catch. Here we go. |
|
Whoohoo! @TomAugspurger did you open an issue with your follow-up items list? |
And there was already one for deprecating SparseDataFrame. I'll add SparseSeries to that discussion. |
Makes SparseArray an ExtensionArray. * Fixed DataFrame.__setitem__ for updating to sparse. Closes pandas-dev#22367 * Fixed Series[sparse].to_sparse Closes pandas-dev#22389 Closes pandas-dev#21978 Closes pandas-dev#19506 Closes pandas-dev#22835
|
Hi! Due to these changes there is an issue #30316 |
| dtype='object')]) | ||
| assert result.dtype == np.object_ | ||
| assert result.ftype == 'object:dense' | ||
| # TODO: release-note: concat sparse dtype |
There was a problem hiding this comment.
@TomAugspurger these TODOs are still present (though now in a different file). Is this actionable?
There was a problem hiding this comment.
Probably not.
Closes #21978
Closes #19506
Closes #22835
High-level summary: SparseArray is an ExtensionArray. It's no longer an ndarray subclass. The actual data model hasn't changed at all, it's still an array and a
sparse_index. Only now the sparse values areself.sparse_values, rather thanself.This isn't really close to being ready yet. I'm going to go through and self-review a bunch of things right now, will call out for others' opinions in specific places.
API discussions:
.astype(np_dtype)be sparse or dense?SparseArray([]).dtype? NumPy defaults to float (Sparse[float64]), pandas typically uses object (Sparse[object])