ENH: Make it possible to call .view on object arrays by eric-wieser · Pull Request #8514 · numpy/numpy

eric-wieser · 2017-01-21T14:36:35Z

Right now you can do this with primitive types:

>>> np.zeros((2, 3), dtype=np.int).view([('a', np.int,3)])
array([[([0, 0, 0],)],
       [([0, 0, 0],)]], 
      dtype=[('a', '<i4', (3,))])

This adds

>>> np.zeros((2, 3), dtype=object).view([('a', object,3)])
array([[([0, 0, 0],)],
       [([0, 0, 0],)]], 
      dtype=[('a', 'O', (3,))])

Which would previously error

This makes it possible to use np.unique on 2d object arrays

seberg · 2017-01-21T14:39:18Z

Not sure what exactly is in it, but you might want to check #5508 before continuing.

seberg · 2017-01-21T14:41:19Z

Ah, sorry, nvm. I remembered there was sometihng about relaxing view, but that one is not about objects.

eric-wieser · 2017-01-21T14:42:34Z

This is simply aiming to change _view_is_safe, which is coming in a future commit

eric-wieser · 2017-01-21T15:03:24Z

numpy/core/_internal.py

This import feels kinda hacky, but it can't go globally. Should I put this function somewhere else?

eric-wieser · 2017-01-21T15:04:02Z

numpy/core/_internal.py

Is there a better way of doing this check? Converting to list?

Useful for finding the location of custom dtype objects, or np.object_ members to improve the behaviour of .view

…n objects This makes unique work on object arrays

ahaldane · 2017-01-22T19:03:44Z

You can also look at #5548, which was an attempt to do something similar. However it was largely reverted in #6562 because it turned out to be too big a performance hit to look up all the object positions, the way I implemented it. I welcome an attempt to fix this, btw.

In #5548, the methods _get_all_field_offsets is probably similar to your find_dtype_offsets, and I also implemented _view_is_safe quite similarly to how you do here.

There are also tests there I wrote which may be useful here.

eric-wieser · 2017-01-23T02:26:26Z

However it was largely reverted in #6562 because it turned out to be too big a performance hit to look up all the object positions

What even is a "performance hit" when the result previous to that patch was an exception? Is it simply the lack of the hasobject short circuit present here that reverted it?

ahaldane · 2017-01-23T07:36:11Z

The problem was due to "hidden" objects, as I discussed with an example in the first comment in #5548. The possibility of hidden objects means I needed to do a computationally expensive overlap check even if the HASOBJECTS flag is false. In #6208 I tried to limit the cases in which the overlap needs to be checked, but it wasn't enough, see the function _may_have_objects there.

As I vaguely recall, when the performance problem became a big enough issue I realized that the other changes I had made over the course of those PRs actually fixed most of the issues I had originally indended to fix, without needing object views. Therefore I was perfectly happy to simply disable object views and remove the safety checks.

I think it would be great to re-enable object views, though. I think it would fix a number of open issues. It's very likely there are fixes to the hidden object problems I encountered which I didn't explore properly, and some fresh eyes on the problem are very welcome.

Maybe a solution is to make the HASOBJECTS flag better reflect whether the array contains objects....? I'm not sure, because that would probably also mess with the reference counting.

eric-wieser · 2017-01-23T11:08:15Z

So right now, this is a somewhat more restrictive version of your PR, as it doesn't allow object members to be hidden in a view, whereas yours just stopped members being reinserted - but as a result, presumably comes at less of a performance cost.

Your issue was the extra time in executing _getfield_is_safe, I assume? Am I right in thinking I am OK to leave that untouched if aiming to simply fix views?

ahaldane · 2017-01-23T16:05:04Z

That's true, a more restrictive version could avoid the problems.

And yes, the performance cost was in _getfield_is_safe, but particularly in the bye-by-byte checks I did which you might not need in the restricted version.

Also, I think a perfomance hit is acceptable as long as it only affects object arrays. My problem was it affected non-object structured arrays too.

eric-wieser · 2017-01-23T16:19:53Z

@ahaldane I'll go ahead and copy the tests from that pull request then.

Would you prefer me to make find_dtype_offsets not part of the public API, as you did? Or change it back to your variant that returns unfiltered fields? There's definitely a performance boost for not wasting time finding the offsets of (np.float, (64, 64, 64)) when all you care about is object arrays

Such as find_dtype_offsets( (float, (1024, 1024, 1024), object )

Adds back tests for object array views. Marks the ones that deliberately fail with knownfailif, for now

eric-wieser · 2017-01-23T17:07:00Z

@ahaldane: Ok, your tests are copied across

ahaldane · 2017-01-23T17:33:46Z

I don't have a strong opinion about where find_dtype_offsets should go, if you think it should be public that's fine with me.

I needed the non-object fields to do byte-overlap checks; if you don't need them it sounds good to optimize them out.

eric-wieser · 2017-02-19T16:18:51Z

@ahaldane: So do you think this needs further changes?

ahaldane · 2017-02-19T17:29:44Z

Oh I didn't realize it was finished.. I'm taking a look now.

ahaldane · 2017-02-19T19:06:13Z

numpy/core/_internal.py

-    return
+    # no object members is fine
+    if not newtype.hasobject and not oldtype.hasobject:
+        return


This is the part I am still thinking about, becuase of the problem of "hidden objects".

It's true that currently, I don't think there is a way to create hidden objects because most operations involving objects are disallowed or make copies, but we want to relax those restrictions.

We will have a problem when trying to solve #5994, as I attempt in my PR #6053. That is, once we allow multi-field views like:

>>> a = np.zeros(5, dtype='O,O,O') >>> b = a[['f0','f2']] >>> b.dtype dtype({'names':['f0','f2'], 'formats':['O','O'], 'offsets':[0,16], 'itemsize':24})

then b will have a hidden object at byte 8. That will be true of any implementation of #5994, besides #6053. (Currently the second line returns a copy instead of a view, thus avoiding hidden objects, but that is what we want to change).

I am trying to figure out the easiest way to get both this PR and multi-field views to interact safely. (this may involve setting limitations on the plans in #5994).

Note how this PR would allow this view, using b from my example above:

>>> b.view('O,p,O') >>> b['f1'] = 0 # segfault?

Perhaps the solution is to disallow multi-field views if it would create hidden objects? That also seems a little wonky. I need to think a bit more about it.

I don't think you meant to comment on this line, because it matches the old behaviour here.

I think I see the problem now. It doesn't exist yet, right, but would be a consequence of allowing partial views.

But as it stands, this pr doesn't allow the creation of those partial views in the first place. I believe this pr allows a strict subset of the things a complete partial-view implementation would allow

Although I guess the result would be that all this gets thrown out when partial views do make it

Perhaps the best solution is just to store a reference to the original dtype for partial views, or simply a list of hidden object offsets?

What about this:

First, I think we only need some relatively minor updates (in a future PR, when multi-field views are enabled) so that the hasobject field is guaranteed to be True if the dtype has objects including hidden objects. Thus, whenever taking a view of an array with objects, the view dtype will have hasobject=true even if the objects are hidden.

This is actually often the case as implemented in #6053 - with c = np.zeros(4, dtype="p,O,p")[['f0','f2']], I see that c.dtype.hasobject is True. However, I'm pretty sure I saw other cases where I can get hasobject to be false if there are hidden objects. Those would need to be fixed.

Then, this PR also needs a modification: The view should only be allowed if all the fields (not just object fields) are present at the right position in the view. This way b.view('O,p,O') is disallowed since the p is missing in a. (And if hasobject=False for both inputs, the view is allowed).

I think that change to this PR still allows most of the use-cases you've pointed out.

charris · 2018-09-25T15:42:33Z

@eric-wieser Needs rebase. What do you want to do with this?

eric-wieser · 2019-05-26T18:47:55Z

numpy/lib/type_check.py

+            return (base_offsets[:,None] + sub_offsets).ravel()
+
+    # record type - combine the fields
+    if dtype.fields:


for when i get back to this: Needs is not None, and a test for np.dtype([])

eric-wieser force-pushed the view-compound-object branch from 35f82ef to f0ecabb Compare January 21, 2017 14:42

eric-wieser commented Jan 21, 2017

View reviewed changes

eric-wieser force-pushed the view-compound-object branch 2 times, most recently from 7bc42b9 to a558e1a Compare January 21, 2017 15:17

ENH: add np.find_dtype_offsets

0e6f9ee

Useful for finding the location of custom dtype objects, or np.object_ members to improve the behaviour of .view

eric-wieser changed the title ~~Make it possible to call .view on object arrays~~ ENH: Make it possible to call .view on object arrays Jan 21, 2017

eric-wieser force-pushed the view-compound-object branch from a558e1a to 93ba54b Compare January 21, 2017 15:28

ENH: Allow object arrays to be .view'd, providing object fields remai…

abff3ab

…n objects This makes unique work on object arrays

eric-wieser force-pushed the view-compound-object branch from 93ba54b to abff3ab Compare January 21, 2017 16:24

MAINT: Remove unnecessary try/catch

99ce4b1

eric-wieser mentioned this pull request Jan 21, 2017

Add axis argument to numpy.unique #7742

Merged

charris added 01 - Enhancement component: numpy._core labels Jan 21, 2017

DOC: add find_type_offsets to list of dtype routines

31793a0

MAINT: Optimization for unwanted subdtypes

eb6d0e5

Such as find_dtype_offsets( (float, (1024, 1024, 1024), object )

eric-wieser force-pushed the view-compound-object branch from 8678575 to eb6d0e5 Compare January 23, 2017 16:28

REV, TST: Partially revert 086d42d

38ecaf3

Adds back tests for object array views. Marks the ones that deliberately fail with knownfailif, for now

ahaldane reviewed Feb 19, 2017

View reviewed changes

eric-wieser mentioned this pull request Jul 4, 2017

recfunctions.append_fields fails on arrays containing objects (Trac #1751) #2346

Closed

eric-wieser mentioned this pull request Apr 5, 2019

BUG: Change dtype comparison in _view_is_safe to be independent of name fields. #13261

Closed

eric-wieser mentioned this pull request May 26, 2019

rfn.rename_fields fails if arrays contains objects #13213

Open

eric-wieser commented May 26, 2019

View reviewed changes

Base automatically changed from master to main March 4, 2021 02:03

anntzer mentioned this pull request Aug 17, 2021

PERF: Rely on C-level str conversions in loadtxt for up to 2x speedup #19687

Closed

charris added the 52 - Inactive Pending author response label Apr 6, 2022

charris closed this Apr 6, 2022

Uh oh!

Conversation

eric-wieser commented Jan 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Jan 21, 2017

Uh oh!

seberg commented Jan 21, 2017

Uh oh!

eric-wieser commented Jan 21, 2017

Uh oh!

eric-wieser Jan 21, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Jan 21, 2017

Choose a reason for hiding this comment

Uh oh!

ahaldane commented Jan 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Jan 23, 2017

Uh oh!

eric-wieser commented Jan 23, 2017

Uh oh!

ahaldane commented Jan 23, 2017

Uh oh!

eric-wieser commented Feb 19, 2017

Uh oh!

ahaldane commented Feb 19, 2017

Uh oh!

ahaldane Feb 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Feb 19, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Feb 19, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Feb 19, 2017

Choose a reason for hiding this comment

Uh oh!

ahaldane Feb 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charris commented Sep 25, 2018

Uh oh!

eric-wieser May 26, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eric-wieser commented Jan 21, 2017 •

edited

Loading

ahaldane commented Jan 22, 2017 •

edited

Loading

eric-wieser commented Jan 23, 2017 •

edited

Loading

ahaldane commented Jan 23, 2017 •

edited

Loading

eric-wieser commented Jan 23, 2017 •

edited

Loading

ahaldane commented Jan 23, 2017 •

edited

Loading

ahaldane Feb 19, 2017 •

edited

Loading

ahaldane Feb 20, 2017 •

edited

Loading