ENH: Make it possible to call .view on object arrays#8514
ENH: Make it possible to call .view on object arrays#8514eric-wieser wants to merge 6 commits intonumpy:mainfrom
Conversation
|
Not sure what exactly is in it, but you might want to check #5508 before continuing. |
|
Ah, sorry, nvm. I remembered there was sometihng about relaxing view, but that one is not about objects. |
|
This is simply aiming to change |
35f82ef to
f0ecabb
Compare
numpy/core/_internal.py
Outdated
There was a problem hiding this comment.
This import feels kinda hacky, but it can't go globally. Should I put this function somewhere else?
numpy/core/_internal.py
Outdated
There was a problem hiding this comment.
Is there a better way of doing this check? Converting to list?
7bc42b9 to
a558e1a
Compare
Useful for finding the location of custom dtype objects, or np.object_ members to improve the behaviour of .view
a558e1a to
93ba54b
Compare
…n objects This makes unique work on object arrays
93ba54b to
abff3ab
Compare
|
You can also look at #5548, which was an attempt to do something similar. However it was largely reverted in #6562 because it turned out to be too big a performance hit to look up all the object positions, the way I implemented it. I welcome an attempt to fix this, btw. In #5548, the methods There are also tests there I wrote which may be useful here. |
What even is a "performance hit" when the result previous to that patch was an exception? Is it simply the lack of the |
|
The problem was due to "hidden" objects, as I discussed with an example in the first comment in #5548. The possibility of hidden objects means I needed to do a computationally expensive overlap check even if the As I vaguely recall, when the performance problem became a big enough issue I realized that the other changes I had made over the course of those PRs actually fixed most of the issues I had originally indended to fix, without needing object views. Therefore I was perfectly happy to simply disable object views and remove the safety checks. I think it would be great to re-enable object views, though. I think it would fix a number of open issues. It's very likely there are fixes to the hidden object problems I encountered which I didn't explore properly, and some fresh eyes on the problem are very welcome. Maybe a solution is to make the HASOBJECTS flag better reflect whether the array contains objects....? I'm not sure, because that would probably also mess with the reference counting. |
|
So right now, this is a somewhat more restrictive version of your PR, as it doesn't allow object members to be hidden in a view, whereas yours just stopped members being reinserted - but as a result, presumably comes at less of a performance cost. Your issue was the extra time in executing |
|
That's true, a more restrictive version could avoid the problems. And yes, the performance cost was in Also, I think a perfomance hit is acceptable as long as it only affects object arrays. My problem was it affected non-object structured arrays too. |
|
@ahaldane I'll go ahead and copy the tests from that pull request then. Would you prefer me to make |
Such as find_dtype_offsets( (float, (1024, 1024, 1024), object )
8678575 to
eb6d0e5
Compare
Adds back tests for object array views. Marks the ones that deliberately fail with knownfailif, for now
|
@ahaldane: Ok, your tests are copied across |
|
I don't have a strong opinion about where I needed the non-object fields to do byte-overlap checks; if you don't need them it sounds good to optimize them out. |
|
@ahaldane: So do you think this needs further changes? |
|
Oh I didn't realize it was finished.. I'm taking a look now. |
| return | ||
| # no object members is fine | ||
| if not newtype.hasobject and not oldtype.hasobject: | ||
| return |
There was a problem hiding this comment.
This is the part I am still thinking about, becuase of the problem of "hidden objects".
It's true that currently, I don't think there is a way to create hidden objects because most operations involving objects are disallowed or make copies, but we want to relax those restrictions.
We will have a problem when trying to solve #5994, as I attempt in my PR #6053. That is, once we allow multi-field views like:
>>> a = np.zeros(5, dtype='O,O,O')
>>> b = a[['f0','f2']]
>>> b.dtype
dtype({'names':['f0','f2'], 'formats':['O','O'], 'offsets':[0,16], 'itemsize':24})then b will have a hidden object at byte 8. That will be true of any implementation of #5994, besides #6053. (Currently the second line returns a copy instead of a view, thus avoiding hidden objects, but that is what we want to change).
I am trying to figure out the easiest way to get both this PR and multi-field views to interact safely. (this may involve setting limitations on the plans in #5994).
Note how this PR would allow this view, using b from my example above:
>>> b.view('O,p,O')
>>> b['f1'] = 0 # segfault?Perhaps the solution is to disallow multi-field views if it would create hidden objects? That also seems a little wonky. I need to think a bit more about it.
There was a problem hiding this comment.
I don't think you meant to comment on this line, because it matches the old behaviour here.
I think I see the problem now. It doesn't exist yet, right, but would be a consequence of allowing partial views.
But as it stands, this pr doesn't allow the creation of those partial views in the first place. I believe this pr allows a strict subset of the things a complete partial-view implementation would allow
There was a problem hiding this comment.
Although I guess the result would be that all this gets thrown out when partial views do make it
There was a problem hiding this comment.
Perhaps the best solution is just to store a reference to the original dtype for partial views, or simply a list of hidden object offsets?
There was a problem hiding this comment.
What about this:
First, I think we only need some relatively minor updates (in a future PR, when multi-field views are enabled) so that the hasobject field is guaranteed to be True if the dtype has objects including hidden objects. Thus, whenever taking a view of an array with objects, the view dtype will have hasobject=true even if the objects are hidden.
This is actually often the case as implemented in #6053 - with c = np.zeros(4, dtype="p,O,p")[['f0','f2']], I see that c.dtype.hasobject is True. However, I'm pretty sure I saw other cases where I can get hasobject to be false if there are hidden objects. Those would need to be fixed.
Then, this PR also needs a modification: The view should only be allowed if all the fields (not just object fields) are present at the right position in the view. This way b.view('O,p,O') is disallowed since the p is missing in a. (And if hasobject=False for both inputs, the view is allowed).
I think that change to this PR still allows most of the use-cases you've pointed out.
|
@eric-wieser Needs rebase. What do you want to do with this? |
| return (base_offsets[:,None] + sub_offsets).ravel() | ||
|
|
||
| # record type - combine the fields | ||
| if dtype.fields: |
There was a problem hiding this comment.
for when i get back to this: Needs is not None, and a test for np.dtype([])
Right now you can do this with primitive types:
This adds
Which would previously error
This makes it possible to use
np.uniqueon 2d object arrays