-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix in vectorized item assignment #1746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ind = Variable(['a'], [0, 1]) | ||
| da[dict(x=ind, y=ind)] = 0 | ||
| expected = DataArray([[0, 1], [1, 0], [1, 1]], dims=['x', 'y']) | ||
| self.assertDataArrayIdentical(expected, da) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future (no need to change this time), you can use assert_identical, from the newer testing tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always forget this, maybe because I usually make tests based on the existing ones...
(I think it is what many contributors do).
I think we should update entire test cases at some point (similar to #1741), rather than gradual migration.
xarray/core/nputils.py
Outdated
| mixed_positions, vindex_positions = _advanced_indexer_subspaces(key) | ||
| self._array[key] = np.moveaxis(value, vindex_positions, | ||
| mixed_positions) | ||
| if np.isscalar(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using utils.is_scalar() and utils.to_0d_array for coercing a scalar value to a numpy array, then using the original code path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that here we should support broadcasting, rather than some special care of scalars.
| value = Variable(dims[-value.ndim:], value).set_dims(dims).data | ||
| value = Variable(dims[-value.ndim:], value) | ||
| # broadcast to become assignable | ||
| value = value.set_dims(dims).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to revert nputils.NumpyVindexAdapter to the master version and to make a broadcasting here.
|
Tests are failing due to #1738. |
| else: | ||
| # xarray-style array indexing | ||
| # DataArray key -> Variable key | ||
| key = {k: v.variable if isinstance(v, DataArray) else v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to enforce consistency for coordinates here? My inclination would be that we should support exactly the same keys in setitem as are valid in getitem. Ideally we should also reuse the same code. That means we should raise errors if there are multiple indexers with inconsistent alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination would be that we should support exactly the same keys in setitem as are valid in getitem.
Reasonable. I will add a validation.
| if not isinstance(value, Variable): | ||
| value = as_compatible_data(value) | ||
| if value.ndim > len(dims): | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this special case error message now that we call set_dims below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think value = Variable(dims[-value.ndim:], value) fails if value.ndim > len(dims).
xarray/core/dataarray.py
Outdated
| # Coordinates in key, value and self[key] should be consistent. | ||
| obj = self[key] | ||
| if isinstance(value, DataArray): | ||
| assert_coordinate_consistent(value, obj.coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking of checking the consistency of coords on each DataArray argument in key instead. I guess we should probably check both!
xarray/core/dataarray.py
Outdated
| # Coordinates in key, value and self[key] should be consistent. | ||
| obj = self[key] | ||
| if isinstance(value, DataArray): | ||
| assert_coordinate_consistent(value, obj.coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you use obj.coords.variables here we can skip the awkward getattr(coords[k], 'variable', coords[k]) above.
xarray/core/coordinates.py
Outdated
| for k in obj.dims: | ||
| # make sure there are no conflict in dimension coordinates | ||
| if (k in coords and k in obj.coords): | ||
| coord = getattr(coords[k], 'variable', coords[k]) # Variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to insist that coords always has the same type (e.g., a dict of with Variable values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really reasonable. Updated.
|
I will update docs as this is a backward-incompatible change. |
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to update assignment for .loc, or does this already handle it? It would be good to at least add tests. We can leave that for another PR if you prefer.
doc/indexing.rst
Outdated
| .. note:: | ||
|
|
||
| If an indexer is a :py:meth:`~xarray.DataArray`, its coordinates should not | ||
| conflict with the selected subpart of the target array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably note the exception for explicitly indexed dimensions with .loc/.sel?
xarray/core/coordinates.py
Outdated
| """ | ||
| for k in obj.dims: | ||
| # make sure there are no conflict in dimension coordinates | ||
| if (k in coords and k in obj.coords): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can drop the extra parentheses here inside if
| assert_coordinate_consistent(value, obj.coords.variables) | ||
| # DataArray key -> Variable key | ||
| key = {k: v.variable if isinstance(v, DataArray) else v | ||
| for k, v in self._item_key_to_dict(key).items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check that coordinates are consistent on the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done a few lines above, obj = self[key], where in .isel we check the coordinates in the key.
But I am wondering this unnecessary indexing, though I think this implementation is the simplest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This could indeed be a significant performance hit. That said, I'm OK leaving this for now, with a TODO note to optimize it later.
|
@shoyer Thanks for the review.
Done. As I commented above, coordinate validation requires an (practically) unused indexing. |
xarray/core/dataset.py
Outdated
| pos_indexers[k] = DataArray(pos_indexers[k], | ||
| coords=coords, dims=v.dims) | ||
| pos_indexers, new_indexes = remap_label_indexers( | ||
| self, method, tolerance, **indexers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only indent by one additional tab for continuation lines.
| assert_coordinate_consistent(value, obj.coords.variables) | ||
| # DataArray key -> Variable key | ||
| key = {k: v.variable if isinstance(v, DataArray) else v | ||
| for k, v in self._item_key_to_dict(key).items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This could indeed be a significant performance hit. That said, I'm OK leaving this for now, with a TODO note to optimize it later.
|
Looks good to me. @fujiisoup please hit the big green button when you're ready! |
Ensure only pandas objects (Series, DataFrame) have their .values extracted during assignment to avoid incorrect coercion of arbitrary objects with a 'values' property (see pydata#1746). Adds regression tests to verify that objects with a values property (e.g. custom classes) are stored as-is in object dtype arrays.
git diff upstream/master **/*py | flake8 --diffwhats-new.rstfor all changes andapi.rstfor new APIFound bugs in
nputils.NumpyVindexAdapter.__setitem__andDataArray.__setitem__.I will add more tests later.
Test case suggestions would be appreciated.