Skip to content

Clean up diagonal#280

Merged
njsmith merged 15 commits intonumpy:masterfrom
njsmith:clean-up-diagonal
Jun 6, 2012
Merged

Clean up diagonal#280
njsmith merged 15 commits intonumpy:masterfrom
njsmith:clean-up-diagonal

Conversation

@njsmith
Copy link
Member

@njsmith njsmith commented May 16, 2012

This pull request implements a transition scheme so that someday PyArray_Diagonal (and its interfaces numpy.diagonal, ndarray.diagonal, etc.) can return a view rather than a copy.

The proposed scheme is:

  • Numpy 1.7: PyArray_Diagonal returns a copy (as before), but writes to this copy produce a DeprecationWarning.
  • Numpy 1.8: PyArray_Diagonal returns a view, but writes to this view produce an error.
  • Numpy 1.9: PyArray_Diagonal returns an ordinary view, transition complete.

The 1.7 DeprecationWarning is implemented by this pull request, which adds a new hidden flag called NPY_ARRAY_WARN_ON_WRITE ('hidden' in the sense that it isn't exposed to Python by the flagsobject, though you can still see it from Python if you want by checking arr.flags.num & 0x10000). This flag is propagated through views, and, if set, causes a warning to be issued on the first write to an array.

I recommend looking at the individual commit diffs rather than the overall pull-request diff, since each commit is a self-contained unit. In particular the first 3 are generally useful cleanups independently of whatever we do with PyArray_Diagonal, the 4th implements the actual change, and the 5th documents it.

njsmith added 5 commits May 15, 2012 15:02
…teable

This is mostly a code cleanup, but it does have a user-visible effect
in that attempting to write to a unwriteable array now consistently
raises ValueError. (It used to randomly raise either ValueError or
RuntimeError.)

Passes numpy.test("full").
This patch removes all direct assignments of non-NULL values to the
'base' field on PyArrayObjects. A new utility function was created to
set up UPDATEIFCOPY arrays, and all assignments to 'base' were
adjusted to use either PyArray_SetBaseObject or the new
PyArray_SetUpdateIfCopyBase.

One advantage of this is that it produces more consistent error
handling. This error handling revealed a bug in the nditer code,
which this patch does *not* yet fix.

The bug is that npyiter_new_temp_array sometimes (when dealing with
reversed axes) creates an array and then returns a view onto it. But,
npyiter_allocate_arrays assumes that the array returned by
npyiter_new_temp_array can have UPDATEIFCOPY set, which requires
reassigning its 'base' field. Previously, this meant that the
temporary array leaked. Now, it produces a ValueError.

This code path doesn't seem to actually be hit very often; only one of
the nditer tests fails because of the change. See
   numpy/core/tests/test_nditer.py:test_iter_array_cast_buggy
All 'expected' values were computed using numpy 1.5.1, so we can be
pretty sure that the recent rewrite didn't change any results.
PyArray_Diagonal is changed to return a copy of the diagonal (as in
numpy 1.6 and earlier), but with a new (hidden) WARN_ON_WRITE flag
set. Writes to this array (or views thereof) will continue to work as
normal, but the first write will trigger a DeprecationWarning.

We also issue this warning if someone extracts a non-numpy writeable
view of the array (e.g., by accessing the Python-level .data
attribute). There are likely still places where the data buffer is
exposed that I've missed -- review welcome!

New known-fail test: eye() for maskna arrays was only implemented by
exploiting ndarray.diagonal's view-ness, so it is now unimplemented
again, and the corresponding test is marked known-fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not known to be True"

Thanks to Travis for catching it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variables must only be declared at the beginning of blocks in the version of C Numpy supports.

@mwiebe
Copy link
Member

mwiebe commented May 18, 2012

After renaming PyArray_RequireWriteable to PyArray_EnsureWriteable, fixing the C compile error, and reenabling the nditer test, I'm ok with merging this.

@njsmith
Copy link
Member Author

njsmith commented May 19, 2012

I'm -0 on the "EnsureWriteable" name -- to me "ensure" means that it will take efforts to make it true. E.g. asarray "ensures" that the input is an ndarray. One can imagine a function that passes through writeable arrays and makes temporary copies of unwriteable ones, but that isn't this function.

So I like RequireWriteable, or perhaps MustBeWriteable.

@njsmith
Copy link
Member Author

njsmith commented May 19, 2012

Okay, I think this is ready to merge (modulo bikeshedding about the function name).

@njsmith
Copy link
Member Author

njsmith commented May 19, 2012

Spoke too soon, but now it should be ready :-)

@teoliphant
Copy link
Member

If one more opinion helps, PyArray_RequireWriteable seems fine to me. I agree that EnsureWriteable seems to indicate that "something will be done to try and make it writeable" while in this case it's a check. I would be +0 on PyArray_CheckWriteable.

@njsmith
Copy link
Member Author

njsmith commented May 20, 2012

The Python C API convention is that "Check" is used for functions that return a boolean, though...

njsmith added 3 commits May 21, 2012 15:28
- NPY_ARRAY_WARN_ON_WRITE flag definition is protected by #ifdef
  _MULTIARRAYMODULE, to make totally sure that we can recycle the flag
  bit later.
- Improve docs for PyArray_SetUpdateIfCopyBase.
- Make the deprecation warning for writes to the diagonal array
  somewhat more terse.
- Use DEPRECATE macro instead of calling PyErr_Warn/PyErr_WarnEx
  directly.
- Comment formatting fixed.
@njsmith
Copy link
Member Author

njsmith commented May 21, 2012

Chuck: I wrote short responses to many of your comments as "edits" to the comments, so as to keep the discussion overview page readable (since it doesn't seem to do threading). But I guess you probably don't get notifications that way, so FYI.

Also, there were a few more Python version compatibility issues that I hadn't noticed, but this PR now passes tests on python 2.4, 2.5, 2.6, 2.7, 3.1, 3.2.

Bottom line: I think the main issue left is the question of whether PyArray_RequireWriteable and PyArray_SetUpdateIfCopyBase should be exposed in the API.

For RequireWriteable: Mark likes it; Chuck doesn't. I'm inclined to leave it public myself. Sure, it's simple enough that extension authors can open-code it everywhere they want it, but it fulfills the two main criteria I can think of for an abstraction: (1) it captures a conceptually coherent idea ("always call this before writing to an array"), (2) while the operation it performs is simple, it's not too simple to get wrong, as evidence by the number of random different exceptions that numpy used to throw depending on which code path discovered the non-writeability.

For UpdateIfCopyBase: I just made this public because it parallels PyArray_SetBaseObject. I'm not sure it's a great idea for user code to ever mess with UPDATEIFCOPY, but eh. UPDATEIFCOPY isn't going anywhere.

So I guess I'm +0 on both staying public, but I really don't care that much. If pulling one or both of them out of the API will let us move on then let's do it :-)

@charris
Copy link
Member

charris commented May 22, 2012

Hi Nathaniel,

The downside of the edits is that they don't get mailed out. I'd like another opinion on the API/flag issues before committing this. I really don't like the first function being in the API and would prefer the code as it is because it is more self documenting. I don't much like passing the error message to it either; it really looks to me like something that belongs in one of the include files.

@mwiebe Mark, could you give your opinion on the functions in the API and using a flag slot for a deprecation warning?

@mwiebe
Copy link
Member

mwiebe commented May 22, 2012

I like the RequireWriteable function because it provides a central place that anything affecting writeability can go. I still don't like the name, "if (RequireWriteable(a) < 0) ..." reads to me like it's asking whether 'a' requires something writeable up until the "< 0" part. Maybe one of "if (RaiseUnlessWriteable(a) < 0) ..." or "if (RaiseIfNotWriteable(a) < 0) ..." is more clear and not too clunky?

The change to UpdateIfCopyBase is good - the additional validity checks caught a clear bug in nditer. It's probably better if this one becomes private, though, as the way to get this set from third-party code is PyArray_FromAny with the NPY_ARRAY_UPDATEIFCOPY flag.

@mwiebe
Copy link
Member

mwiebe commented May 22, 2012

A flag seems to me like the most reasonable way to trigger this warning. Maybe the flag's #define could go in 'src/multiarray/arraytypes.h' instead of the public header 'ndarraytypes.h' with the include guard? It would be nice to avoid putting things people aren't supposed to use in the headers people will read to see what the API provides.

@charris
Copy link
Member

charris commented May 22, 2012

How about AssertWriteable? Assert also comes with an expectation of an error message.

@charris
Copy link
Member

charris commented May 22, 2012

Maybe the flags should be divided into public/private, with the private ones starting at the high end. It's a shame the flags slot is an integer instead of something of fixed size, but I think it will be at least 32 bits on all the architectures we support.

@teoliphant
Copy link
Member

+1 on public/private flag distinction and +1 on AssertWriteable (but I don't have a strong opinion on the API name).

@mwiebe
Copy link
Member

mwiebe commented May 22, 2012

+1 on AssertWriteable too

@teoliphant
Copy link
Member

@njsmith Doh! You are right. I gave a confusing explanation. It is the fact that a might have a base it's already pointing to that is the problem. ret should be pointed to 'a' as it's base without any compression of the base-object.

If we don't want to hard-code access to the ->base pointer, then we need the SetIfUpdateIfCopyBase (but, it should be called something more general because it's just basically arg1->base = arg2 with no base traversal....). UpdateIfCopy is one use case for not wanting base-traversal, but there may be others, so just call it SetImmediateBase or something.

@njsmith
Copy link
Member Author

njsmith commented May 23, 2012

@charris: I'm not putting anything on you; I'm just noting that we were both wrong about whether we had consensus. I'm also happy to compromise, like I said already, but I'm not going to just pretend an engineering problem doesn't exist in favor of making discussion easier; that would be irresponsible.

If you think that AssertWriteable reads sufficiently better to override API clarity then that's a pretty strong statement in favor of euphony, so maybe we should think harder about how to get both. InsistWriteable? DisallowUnwriteable? DemandWriteable?

@njsmith
Copy link
Member Author

njsmith commented May 24, 2012

@teoliphant: How about we just let NA_OutputArray call NA_IoArray (and thence PyArray_FromAny), and make SetUpdateIfCopyBase private like @mwiebe suggested (#280 (comment)). That way we can always change it later if it turns out to matter. Works?

@teoliphant
Copy link
Member

@njsmith I don't think that is the appropriate thing to do. There really does need to be an API for setting the base object that does not compress the bases. This should have been added when PyArray_SetBaseObject was added in the first place, but for some reason was not. Given that we need that API anyway, we should just use it in NA_OutputArray.

@njsmith
Copy link
Member Author

njsmith commented May 24, 2012

@teoliphant: I'm afraid I'm lost now. Why would we ever need such an API? Does it really make sense to be adding an API with no users just in case some come along later?

@teoliphant
Copy link
Member

@njsmith If we really don't want to extend the API, then just put the code in NA_OutputArray to set the base object directly: i.e. PyArray_BASE(ret)=(PyObject *)a and deprecate NA_OutputArray pointing people to PyArray_FromAny instead. But, I still think we need an API that is exactly equivalent to PyArray_BASE(ref)=(PyObject *)a. I'm still not sold that there is a problem with that spelling in the first place.

@teoliphant
Copy link
Member

@njsmith The PyArray_SetBaseObject API is the new API --- very recently added, I guess by Mark, but I'm not sure. It did not receive a lot of discussion. This API compresses the bases (i.e. finds the object which actually owns the data). In the past, the base always pointed to the most-recent object from which the array derived. Doing this compression has certain advantages, but it might not always be the right thing to do (like in this case). I'm not so sure that all the places where PyArrary_SetBaseObject(arg1, arg2) is now used, it should not still be PyArray_BASE(arg1) = (PyObject *)arg2. So, my point is that if you are going to add the 1 API, you have to either accept the spelling for the other use case as PyArray_BASE(arg1) = (PyObject *)arg2 or add another API call.

I would advocate just accepting the spelling PyArray_BASE(arg1) = (PyObject *)arg2. I still don't really understand what we are buying by "function calls" instead. I am not convinced that it is actually beneficial in the long run.

In summary: just change the PyArray_SetBaseObject(ret, a) to PyArray_BASE(ret) = (PyObject *)a;

@njsmith
Copy link
Member Author

njsmith commented May 24, 2012

@teoliphant: I see, right. It looks like Mark originally added it in b7cc20a (July 2011) as part of the direct field access deprecation stuff he did.

I'm also skeptical about the value of using function calls everywhere as a matter of course, but in this particular case I think it gives concrete advantages. The SetBaseObject API has better error checking, it gives us the base compression trick (which I guess saves memory usage in some cases), and -- the reason why we're talking about this at all as part of this PyArray_Diagonal change -- it's the only reliable way to detect views, so it gives a nice unobtrusive place to propagate the WARN_ON_WRITE flag to views. Can't do that with direct field access, whether or not it's wrapped in a macro like PyArray_BASE.

IIUC, there are exactly three cases where the ->base field is used:

  • It can pin a PyArrayObject in memory when making a view
  • It can pin an arbitrary non-ndarray PyObject in memory
  • It can designate the destination of an UPDATEIFCOPY writeback

We're currently handling all three of these cases correctly without using the new API you're suggesting. A quick skim through the existing calls to SetBaseObject seem to show that they all fall into the first two categories, which means that SetBaseObject is fine for them. I would definitely feel better about adding an API for this is you could come up with even one example of what it would be good for, besides UPDATEIFCOPY.

Well, anyway, there are a few possible ways forward:

  • Just use direct field assignment in NA_OutputArray. This means that it'll be buggy with respect to WARN_ON_WRITE and possibly other future changes to base object handling, but maybe we just don't care if NA_OutputArray is a bit buggy.
  • Remove base compression from SetBaseObject. This means that in a view chain, the intermediate PyArrayObject structs will stay pinned in memory so long as the outermost view object lives, which is wasteful of memory but probably unmeasurable in practice.
  • Add a new API function that acts like SetBaseObject but without base compression. This would be a pretty trivial and unused API function to support, though.
  • Use fa157f2255fee20cccb9a66a250d0c768f7d70fb to kick this down the road until someone comes up with another case where avoiding base compression is actually needed. In the mean time the PyArray_BASE(arr) = ... spelling will continue to exist...

Thoughts?

@teoliphant
Copy link
Member

@njsmith I am now confused. The API we are talking about is not mine, it is the one in this PR: SetUpdateIfCopyBase. The UPDATEIFCOPY facility is an important one. PyArray_SetBaseObject currently does not support that use-case which is why everywhere in current master that the use-case is needed, direct setting of the base member is used: including NA_OutputArray. That's why you introduced the API that we are talking about in order to intercept this use-case of setting the base object. You wanted to create the function-call internal to NumPy inorder to propagate your new WARN_ON_WRITE flag.

I presume you are going to keep this internal function call. If that is the case, then we must export it as well, because NA_OutputArray is just a simple use-case that mimics what other consumers of the NumPy API might be doing on their own: Create an empty array to hold data in a contiguous fashion (that you might be using for an output array when you call some C or Fortran code) and then attach another malformed NumPy Array to the "base" object in order to propagate any changes back to the malformed output array when the routine finishes.

PyArray_FromAny does not do this. It takes an array and returns one that satisfies certain requirements (allowing the UPDATEIFCOPY flag to be set if changes to the result should be copied back to the original). I suppose it could be used in a round-about way so that one could call PyArray_Empty and then PyArray_FromAny, but that is a very indirect way to do a very simple thing and much too confusing.

This is an important and documented use-case. If you want to intercept the simple operation of setting the base object for UPDATEIFCOPY functionality, then you will need to also introduce an API for consumers of NumPy to use if they want to help their code not be buggy.

This seems very cut and dry to me and I feel pretty strongly that if you are going to replace the simple setting of the base object in PyArray_FromAny and other places in the NumPy code base with a function call, then this function call must also be exported to other downstream consumers so they can do the "right" thing as well.

Now, another possibility is to simply add a flag to the PyArray_SetBaseObject API which toggles whether or not base chain compression happens. This flag could be called updateifcopy. But, I think Mark would prefer that there be another API because I know he doesn't like re-using the base object for two purposes (but I don't like having two structure members when one will do because the use-cases are orthogonal).

On a related note, I can't think of another case where setting the base-object does not really benefit from compression, but that doesn't mean they don't exist. There has not been base chain compression until 1.7 and so I'm not really sure if consumers have been relying on the fact that say a slice of the array has a base-object that points to the original array. They certainly could have been. I've done that from time to time in simple tests and examples (i.e. looked at the shape of the base-object which resulted from a slice operation). This will no longer be the case if the original array used memory from a buffer. There are work-arounds for all these use cases, but that doesn't mean we aren't going to create issues in people's code moving to 1.7.

So, this new behavior of PyArray_SetBaseObject has to be very well advertised to avoid downstream problems. If the flag exists to toggle compression of the bases, then it would be an easy thing to change this behavior. But, this also suggests that another API is useful to differentiate the updateifcopy use case from the chain compression use-case. Updateifcopy won't ever want to compress the chain, but it also has other desired features (like ensuring propagation of your WARN flag).

So, I am supportive of:

  • exporting the new API call you introduced
  • adding a flag or enumerated argument to PyArray_SetObjectBase that allows it to be called with different behaviors:
    * COMPRESS
    * NO_COMPRESS
    * UPDATEIFCOPY

My preference is the last one because it also allows for consumers to not compress the base argument if they do not want to. Of course, those consumers could continue to use PyArray_BASE(arr) directly which is why the first case is acceptable, but it seems cleaner to put all BaseSetting into a single function call.

@mwiebe
Copy link
Member

mwiebe commented May 24, 2012

For reference here's the bug report that the chain compression in PyArray_SetBaseObject fixed:

http://projects.scipy.org/numpy/ticket/466

@njsmith
Copy link
Member Author

njsmith commented May 24, 2012

@teoliphant: I see! I thought you were saying that we needed all three of (1) SetBaseObject, (2) SetUpdateIfCopyBase, (3) JustSetTheBaseWithoutCompression, as separate APIs.

I was all set to say screw it, let's just disable base compression across the board, but I guess that bug Mark points to is legitimate. Not sure anyone really needs to reverse an array a million times, but okay, presumably they ran into it somehow...

So. I actually agree with Mark (or at least your characterization of his opinion :-)), and think the overloading of "base" is a bit ugly (and it's the basic cause of the bug nditer bug this change discovered -- it was trying to have an array be a view and UPDATEIFCOPY simultaneously). I'm also generally opposed to adding interfaces "just in case" someone ever finds a use for them, even though we can't actually think of one. So on balance I prefer option (1), of making SetUpdateIfCopyBase into a real API. Sound good? (esp. a question for @mwiebe, since you raised the original objection to this being public?)

As for the question of what to do with base compression in general, I don't know, and it isn't really relevant to PyArray_Diagonal -- can we split that off into a separate bug or PR?

@teoliphant
Copy link
Member

@njsmith Great! I know I have been confusing in my part of the discussion by commenting on related API issues. I definitely did not want three API calls. I'm completely fine with just the two (and the fact that people can still call PyArray_BASE if they really, really want to to avoid compressing bases).

I won't argue that overloading "base" for two purposes is not ugly --- it just was better than introducing a new field just to support Numarray's UPDATEIFCOPY use-case.

I'm +1 on the public API for PyArray_SetUpdateIfCopyBase.

We can move the discussion of what will happen when PyArray_SetBaseObject compresses the base-chain on list.

@njsmith
Copy link
Member Author

njsmith commented May 24, 2012

Okay, so unless someone else objects, I'll plan to revert the NA_OutputArray change and leave SetUpdateIfCopyBase exposed in the API.

Anyone want to weigh in on the naming controversy?

@mwiebe
Copy link
Member

mwiebe commented May 25, 2012

I like having PyArray_SetUpdateIfCopyBase be a public API better than suggesting people use PyArray_BASE, because of its better error checking that caught the nditer bug. In general, I don't really like the UPDATEIFCOPY mechanism, as it relies on Py_DECREF being deterministic to work, and it's easy to imagine something outside of the control of a function using it caching a reference to an array somewhere and causing it to break. I suspect this required special handling for .NET in the numpy-refactor branch? I can understand it needing to stick around for backwards-compatibility reasons, though.

For the renaming of PyArray_RequireWriteable, I slightly favour PyArray_FailUnlessWriteable over PyArray_AssertWriteable, but both of them are ok. Avoiding the "assert" word because it has debug connotations makes sense to me.

@teoliphant
Copy link
Member

The UPDATEIFCOPY mechanism was introduced by Numarray and does solve a fundamental problem which you have to solve in another way if you don't do this. The problem is how to allow for an output array that could be arbitrarily mis-behaved while providing to a person wrapping C- or Fortran- code a simple well-behaved array to write to.

I'm not sure why it relies on the determinism of Py_DECREF. The copy-back will occur whenever the deallocation of the shadow occurs. It does rely on not messing with the READONLY flag of the array being shadowed. Of course, if someone keeps a reference to the shadow array, then the copy-back won't ever occur.

I suppose the same functionality could be handled in all cases with PyArray_FromAny and PyArray_CopyInto, though. So, instead of relying on Py_DECREF() of the UPDATEIFCOPY array, you explicitly call PyArray_CopyInto if needed before removing the array. I could see encouraging that style instead of using UPDATEIFCOPY.

Incidentally, deterministic memory management has some benefits (especially for codes that want predictable behavior without wondering whether the garbage collector will suddenly jump in). My enthusiasm for non-deterministic garbage collection strategies is not high. To me, it is a distinct advantage of the CPython run-time.

I like PyArray_FailUnlessWriteable just fine.

@njsmith
Copy link
Member Author

njsmith commented May 26, 2012

Okay, sounds like we're keeping SetUpdateIfCopyBase, so I "unpushed" that last change to NA_OutputArray (so fa157f2255fee20cccb9a66a250d0c768f7d70fb has disappeared from the pull request list of commits, though github doesn't seem to indicate this anywhere).

That leaves just the debate about AssertWriteable/FailUnlessWriteable/other. It sounds like each has ~3 votes; or maybe something like DemandWriteable or InsistWriteable would be even better. @charris -- thoughts?

@njsmith
Copy link
Member Author

njsmith commented Jun 1, 2012

Ping.

@njsmith
Copy link
Member Author

njsmith commented Jun 6, 2012

I guess what Chuck is saying here is that he's good with this as it stands? So I'll go ahead and merge this with FailUnlessWriteable, and we can always revisit the issue.

@njsmith njsmith merged commit 51616c9 into numpy:master Jun 6, 2012
WarrenWeckesser added a commit to WarrenWeckesser/numpy that referenced this pull request Dec 23, 2021
The function that was added to the C API (in numpygh-280) is
`PyArray_FailUnlessWriteable`.
luyahan pushed a commit to plctlab/numpy that referenced this pull request Apr 25, 2024
feat: Add vrev16[q]_[s8|u8]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants